From 8a029f08fd83fd54553e2b425059b14c2ea544ec Mon Sep 17 00:00:00 2001 From: araison Date: Sun, 8 Nov 2020 15:55:49 +0100 Subject: [PATCH] Testing by your own the project in README --- README.md | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..b98070f --- /dev/null +++ b/README.md @@ -0,0 +1,166 @@ +# GPUs Monitor + + + +`gpus_monitor` is a Python GPUs activities monitoring tool designed to report by email new and recently died compute processes over the machine where it has been run on. +Basically, when you have just run a new stable training on the machine where `gpus_monitor` listen to, you will received in a few seconds an email notification. This email will contains several informations about the process you've launched. +You received also an email if a compute process died (with EXIT_STATUS = 0 or not). + + +### Kind of mail gpus_monitor is going to send you : + +1. New training detected : + +>> From: +>> Subject : 1 processes running on (LOCAL_IP_OF_THE_MACHINE) +>> +>> New events (triggered on the 08/11/2020 11:45:52): +>> +>> --------------------------------------------------------------------------------------------------------------- +>> A new process (PID : 12350) has been launched on GPU 0 (Quadro RTX 4000) by since 08/11/2020 11:43:48 +>> His owner () has executed the following command : +>> python3 test_torch.py +>> From : +>> +>> +>> CPU Status (currently): +>> For this process : 19/40 logic cores (47.5%) +>> +>> GPU Status (currently): +>> - Used memory (for this process): 879 / 7979.1875 MiB (11.02 % used) +>> - Used memory (for all processes running on this GPU) 7935.3125 / 7979.1875 MiB (99.45 % used) +>> - Temperature : 83 Celsius +>> - Driver version : 435.21 +>> --------------------------------------------------------------------------------------------------------------- +>> +>> +>> +>> This message has been automatically send by a robot. Please don't answer to this mail +>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool + + +2. Training died (either finished well or not) +>> From: +>> Subject : 1 processes running on (LOCAL_IP_OF_THE_MACHINE) +>> +>> New events (triggered on the 08/11/2020 11:47:29): +>> +>> --------------------------------------------------------------------------------------------------------------- +>> The process (PID : 12350) launched by araison since 08/11/2020 11:43:48 has ended. +>> His owner araison had executed the following command : +>> python3 test_torch.py +>> From : +>> +>> +>> The process took 0:03:41 to finish. +>> -------------------------------------------------------------------------------------------------------------- +>> +>> This message has been automatically send by a robot. Please don't answer to this mail +>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool + + +## Instructions to use gpus_monitor : + + +1. Cloning this repository : + +`git clone https://github.com/araison12/gpus_monitor.git` + +2. Installing dependencies : + +`pip3 install -r gpus_monitor/requirements.txt` + +or + +`python3 setup.py install --user` + +3. Add peoples mail to the list of the `persons_to_inform.yaml` file : + +Example: + +```yaml +list: + - adrien.raison@univ-poitiers.fr + - other_person_to_inform@hisdomain.com +``` + + + +Note : You can hot-add/remove mails in this file without the need of killing the scanning process ! + +4. Add SMTP Server parameters (server adress, credentials, port number, etc..) + +You can manage these stuff in the `gpus_monitor/src/gpus_monitor/config.py` file : +To adjust these varibales you have to edit the `gpus_monitor/src/gpus_monitor/config.py` file. + +```bash +cd gpus_monitor/src/gpus_monitor/ +vim config.py +``` + + +For privacy purposes, login of my dedicated SMTP account are stored in 2 machine environment variables. I've set up a brandnew Gmail account for my `gpus_monitor` instance. I can share with you my credentials in order to use a single SMTP account for `gpus_monitor` instance on several machines, feel free to send me an email ! +Otherwise, fill in with your own SMTP server configuration. + + +```python +USER = os.environ.get( + "GPUSMONITOR_MAIL_USER" +) + +PASSWORD = os.environ.get( + "GPUSMONITOR_MAIL_PASSWORD" +) +PORT = 465 +SMTP_SERVER = "smtp.gmail.com" +``` + +See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables. + +5. Adjust the scanning rate of `gpus_monitor` and the processes age that he has to take in account. + + +The `WAITING_TIME` variable adjusts the scan timing rate of gpus_monitor. + +```python +WAITING_TIME = 0.5 # min +``` + +The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to take in account. + +```python +PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes) +``` + +6. Executing `gpus_monitor` when machine starts up. + +```bash +crontab -e +``` +Add the following line to the brandnew opened file : + +```bash +@reboot python3 /path/to/gpu_monitor/src/gpus_monitor/main.py +``` + +# Ideas to enhanced the project : + +- Log system (owner, total calculation time by user) +- Manage cases in email sending (subject): processes finished well or not (Send Traceback) +- Centralized system that scan every machine on a given IP adresses range. +- Better errors management (SMTP connection failed, no Cuda GPU on the machine,..) +- Documenting the project +- Rewrite it in oriented object fashion + + +If you have any ideas to improve this project, don't hesitate to make a merge request ! :) + + + +# To test `gpus_monitors` by your own: + +I've implemented a the tiny non linear XOR problem in pyTorch. +You can test `gpus_monitor` by your own while running : +```bash +python3 gpus_monitor/test_torch.py +```