gpus_monitor/README.rst
2020-11-08 15:51:36 +01:00

5.7 KiB

# GPUs Monitor

gpus_monitor is a Python GPUs activities monitoring tool designed to report by email new and recently died compute processes over the machine where it has been run on. Basically, when you have just run a new stable training on the machine where gpus_monitor listen to, you will received in a few seconds an email notification. This email will contains several informations about the process you've launched. You received also an email if a compute process died (with EXIT_STATUS = 0 or not).

### Kind of mail gpus_monitor is going to send you :

  1. New training detected :

>> From: <gpusstatus@mydomain.com> >> Subject : 1 processes running on <MACHINE_NAME> (LOCAL_IP_OF_THE_MACHINE) >> >> New events (triggered on the 08/11/2020 11:45:52): >> >> --------------------------------------------------------------------------------------------------------------->> A new process (PID : 12350) has been launched on GPU 0 (Quadro RTX 4000) by <owner_of_the_process> since 08/11/2020 11:43:48 >> His owner (<owner_of_the_process>) has executed the following command : >> python3 test_torch.py >> From : >> <absolute_path_to_the_script_of_the_new_launched_process> >> >> CPU Status (currently): >> For this process : 19/40 logic cores (47.5%) >> >> GPU Status (currently): >> - Used memory (for this process): 879 / 7979.1875 MiB (11.02 % used) >> - Used memory (for all processes running on this GPU) 7935.3125 / 7979.1875 MiB (99.45 % used) >> - Temperature : 83 Celsius >> - Driver version : 435.21 >> --------------------------------------------------------------------------------------------------------------->> >> >> >> This message has been automatically send by a robot. Please don't answer to this mail >> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool

2. Training died (either finished well or not) >> From: <gpusstatus@mydomain.com> >> Subject : 1 processes running on <MACHINE_NAME> (LOCAL_IP_OF_THE_MACHINE) >> >> New events (triggered on the 08/11/2020 11:47:29): >> >> --------------------------------------------------------------------------------------------------------------->> The process (PID : 12350) launched by araison since 08/11/2020 11:43:48 has ended. >> His owner araison had executed the following command : >> python3 test_torch.py >> From : >> <absolute_path_to_the_script_of_the_died_process> >> >> The process took 0:03:41 to finish. >> -------------------------------------------------------------------------------------------------------------->> >> This message has been automatically send by a robot. Please don't answer to this mail >> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool

## Instructions to use gpus_monitor :

  1. Cloning this repository :

git clone https://github.com/araison12/gpus_monitor.git

  1. Installing dependencies :

pip3 install -r gpus_monitor/requirements.txt

or

python3 setup.py install --user

  1. Add peoples mail to the list of the persons_to_inform.yaml file :

Example:

`yaml list: - adrien.raison@univ-poitiers.fr - other_person_to_inform@hisdomain.com`

Note : You can hot-add/remove mails in this file without the need of killing the scanning process !

  1. Add SMTP Server parameters (server adress, credentials, port number, etc..)

You can manage these stuff in the gpus_monitor/src/gpus_monitor/config.py file : To adjust these varibales you have to edit the gpus_monitor/src/gpus_monitor/config.py file.

`bash cd gpus_monitor/src/gpus_monitor/ vim config.py`

For privacy purposes, login of my dedicated SMTP account are stored in 2 machine environment variables. I've set up a brandnew Gmail account for my gpus_monitor instance. I can share with you my credentials in order to use a single SMTP account for gpus_monitor instance on several machines, feel free to send me an email ! Otherwise, fill in with your own SMTP server configuration.

`python USER = os.environ.get( "GPUSMONITOR_MAIL_USER" ) PASSWORD = os.environ.get( "GPUSMONITOR_MAIL_PASSWORD" ) PORT = 465 SMTP_SERVER = "smtp.gmail.com"`

See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables.

  1. Adjust the scanning rate of gpus_monitor and the processes age that he has to take in account.

The WAITING_TIME variable adjusts the scan timing rate of gpus_monitor.

`python WAITING_TIME = 0.5 # min`

The PROCESS_AGE variable adjusts the processes age that gpus_monitor has to take in account.

`python PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes)`

  1. Executing gpus_monitor when machine starts up.

`bash crontab -e` Add the following line to the brandnew opened file :

`bash @reboot python3 /path/to/gpu_monitor/src/gpus_monitor/main.py`

# Ideas to enhanced the project :

  • Log system (owner, total calculation time by user)
  • Manage cases in email sending (subject): processes finished well or not (Send Traceback)
  • Centralized system that scan every machine on a given IP adresses range.
  • Better errors management (SMTP connection failed, no Cuda GPU on the machine,..)
  • Documenting the project
  • Rewrite it in oriented object fashion

If you have any ideas to improve this project, don't hesitate to make a merge request ! :)