From 3f485011b8dc116faa4eef225ecd73f3f2b08c71 Mon Sep 17 00:00:00 2001 From: Nils Magnus Date: Sun, 24 Mar 2024 00:38:41 +0000 Subject: [PATCH] add details to user documentation --- README.md | 59 ++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 39 insertions(+), 20 deletions(-) diff --git a/README.md b/README.md index f076297..f6a1369 100644 --- a/README.md +++ b/README.md @@ -1,42 +1,61 @@ # Help Center Spider -## About -This is a spider tool with which you can visit all links on https://docs.otc.t-systems.com to find urls that are not correct. -## Requirements -After you cloned the repository you need to prepare an environment to run the tool. You can easily do this with -python virtual environment: +The Open Telekom Cloud Helpcenter Spider is a spider tool visiting all +links starting from its landing page on https://docs.otc.t-systems.com/ +to find and identify urls that are not correct. It parses all types of +hyperlinks and normalizes them into a canonical format. The spider +descents into the document tree via [...] bredth or width first search. +[and does what?] [when is logged which event?] + +## Getting started + +Once you installed the code and its required packages into an virtual +environment and checked its configuration file `config.json`, the web +spider starts invoking the tool without any arguments. Results are +listed in [... TBD]. + +## Requirements and Installation +After you cloned this repository you need to prepare an environment to +run the tool. You can easily do this with a Python virtual environment: ``` -$ cd / +$ cd _local_folder_/ +$ git clone https://gitea.eco.tsi-dev.otc-service.com/infra/hc-spider.git +$ cd hc-spider $ python -m venv venv/ $ source venv/bin/activate (venv)$ python -m pip install -r requirements.txt ``` ## Configuration -In _config.json_ you can define a couple items: +In _config.json_ you can define several items: -- _watchdog_file_: if you run the tool in the background and want to stop it properly (not using `kill`), -just send an exit message into the watchdog file: `echo exit > watchdog.fifo` -- _timer_runtime_: maximum runtime limit in seconds -- _log_dir_: logging folder -- _logging_interval_: frequency of dumping log files -- _workers_: number of workers (background processes) you want to run. If you set to 0 it will count from the number of cores (_number_of_cores_ - 1) +- _watchdog_file_: if you run the tool in the background and want to + stop it properly (without sending a signal with `kill`), just send + an exit message into the watchdog file: `echo exit > watchdog.fifo`. +- _timer_runtime_: maximum runtime limit in seconds. +- _log_dir_: logging folder. +- _logging_interval_: frequency of dumping log files. +- _workers_: number of workers (background processes) you want to run. + If you set to 0 it will count from the number of cores + (_number_of_cores_ - 1) - _starting_point_: base url where to start -## How to run -There are two ways to do it +## Operations +There are two ways to start the spider: -### In foreground +### In the foreground ``` $ source venv/bin/activate -$ python main.py +(venv)$ python main.py ``` -### In background +### In the background ``` $ source venv/bin/activate -$ nohup python main.py > log/hc_spider.log 2> log/hc_spider.err <&- & +(venv)$ nohup python main.py > log/hc_spider.log 2> log/hc_spider.err <&- & ``` -In case you running the tool in background you can stop the execution with `$ echo exit > ` +### Stopping the process polietely +To stop the tool when run in the background, send a command to the +control fifo with: `(venv)$ echo exit > _watchdog_file_`