gpus_monitor/README.md

# NVIDIA ® GPUs monitor

`gpus_monitor` is a Pythonic NVIDIA ® GPUs activities monitoring tool designed to report (stdout, email, etc.) new or recently died computing processes.

## Use case

Basically, when you have just run a new stable training on the machine where
`gpus_monitor` is listening to, you will received within few seconds an email
notification about this brand new script launch. This email will contains several informations regarding this process.

You will also receive an email if a compute process died (either because of EXIT_STATUS = 0 or not).


## Mail content gpus_monitor will send you :

1. New training detected :

>> From: <gpusstatus@mydomain.com>
>> Subject : 1 processes running on <MACHINE_NAME> (LOCAL_IP_OF_THE_MACHINE)
>> 
>> New events (triggered on the 08/11/2020 11:45:52):
>> 
>>             ---------------------------------------------------------------------------------------------------------------
>>             A new process (PID : 12350) has been launched on GPU 0 (Quadro RTX 4000) by <owner_of_the_process> since 08/11/2020 11:43:48
>>             His owner (<owner_of_the_process>) has executed the following command :
>>                 python3 test_torch.py
>>             From :
>>                 <absolute_path_to_the_script_of_the_new_launched_process>
>>             
>>             CPU Status (currently):
>>                 For this process : 19/40 logic cores (47.5%)
>>             
>>             GPU Status (currently):
>>                 - Used memory (for this process): 879 / 7979.1875 MiB (11.02 % used)
>>                 - Used memory (for all processes running on this GPU) 7935.3125 / 7979.1875 MiB (99.45 % used)
>>                 - Temperature : 83 Celsius
>>                 - Driver version : 435.21
>>             ---------------------------------------------------------------------------------------------------------------
>>             
>>             
>>         
>> This message has been automatically send by a robot. Please don't answer to this mail
>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool


2. Training died (either finished well or not)
>> From: <gpusstatus@mydomain.com>
>> Subject : 1 processes running on <MACHINE_NAME> (LOCAL_IP_OF_THE_MACHINE)
>> 
>> New events (triggered on the 08/11/2020 11:47:29):
>> 
>>         ---------------------------------------------------------------------------------------------------------------
>>         The process (PID : 12350) launched by araison since 08/11/2020 11:43:48 has ended.
>>         His owner <owner_of_the_process> had executed the following command :
>>             python3 test_torch.py
>>         From :
>>             <absolute_path_to_the_script_of_the_died_process>
>>         
>>         The process took 0:03:41 to finish.
>>         --------------------------------------------------------------------------------------------------------------
>>     
>> This message has been automatically send by a robot. Please don't answer to this mail
>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool      


## How to 


1. Cloning this repository :

`git clone https://gitea.zaclys.com/araison/gpus_monitor.git`

2. Installing dependencies :

`pip3 install -r gpus_monitor/requirements.txt`

or

`python3 gpus_monitor/setup.py install --user`

3. Add peoples mail to the list of the `persons_to_inform.yaml` file :

Example:

```yaml
list:  
	- <email 1>
	- <other_person_to_inform@hisdomain.com>
```
	
	
Note : You can hot-add/remove mails in this file without the need of killing the scanning process !

4. Add SMTP Server parameters (server adress, credentials, port number, etc..)

You can manage these stuff in the `gpus_monitor/src/gpus_monitor/config.py` file :
To adjust these varibales you have to edit the `gpus_monitor/src/gpus_monitor/config.py` file.

```bash
cd gpus_monitor/src/gpus_monitor/
vim config.py
```


For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables. 

```python
USER = os.environ.get(
    "GPUSMONITOR_MAIL_USER"
)  

PASSWORD = os.environ.get(
    "GPUSMONITOR_MAIL_PASSWORD"
) 
PORT = 465
SMTP_SERVER = "smtp.example.com"
```

See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables.

5. Adjust the scanning rate of `gpus_monitor` and the processes age that he has to take in account.


The `WAITING_TIME` variable adjusts the scan timing rate of gpus_monitor.

```python
WAITING_TIME = 0.5  # min
```

The `PROCESS_AGE`  variable adjusts the processes age that gpus_monitor has to track down.

```python
PROCESS_AGE = 2  # min (gpus_monitor only consider >=2min aged processes)
```

6. Executing `gpus_monitor` when machine starts up.

```bash
crontab -e
```
Add the following line to the brandnew opened file :

```bash
@reboot python3 /path/to/gpu_monitor/src/gpus_monitor/main.py
```

## Ideas to enhance the project :

If you have any ideas to improve this project, don't hesitate to make a merge request ! :)


## To test `gpus_monitor` by your own:

I've implemented the tiny non linear XOR problem in PyTorch.
You can test `gpus_monitor` by your own while running :
```bash
python3 gpus_monitor/test_torch.py
```
Update 2024-04-28 15:02:51 +00:00			`# NVIDIA ® GPUs monitor`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
Update 2024-04-28 15:02:12 +00:00			`gpus_monitor` is a Pythonic NVIDIA ® GPUs activities monitoring tool designed to report (stdout, email, etc.) new or recently died computing processes.
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
Update 2024-04-28 15:02:12 +00:00			`## Use case`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
Fixing error in README 2020-11-08 15:06:29 +00:00			`Basically, when you have just run a new stable training on the machine where`
Update 2024-04-28 15:02:12 +00:00			`gpus_monitor` is listening to, you will received within few seconds an email
			`notification about this brand new script launch. This email will contains several informations regarding this process.`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
Update 2024-04-28 15:02:12 +00:00			`You will also receive an email if a compute process died (either because of EXIT_STATUS = 0 or not).`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
Update 2024-04-28 15:02:12 +00:00
			`## Mail content gpus_monitor will send you :`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
			`1. New training detected :`

			`>> From: <gpusstatus@mydomain.com>`
			`>> Subject : 1 processes running on <MACHINE_NAME> (LOCAL_IP_OF_THE_MACHINE)`
			`>>`
			`>> New events (triggered on the 08/11/2020 11:45:52):`
			`>>`
			`>> ---------------------------------------------------------------------------------------------------------------`
			`>> A new process (PID : 12350) has been launched on GPU 0 (Quadro RTX 4000) by <owner_of_the_process> since 08/11/2020 11:43:48`
			`>> His owner (<owner_of_the_process>) has executed the following command :`
			`>> python3 test_torch.py`
			`>> From :`
			`>> <absolute_path_to_the_script_of_the_new_launched_process>`
			`>>`
			`>> CPU Status (currently):`
			`>> For this process : 19/40 logic cores (47.5%)`
			`>>`
			`>> GPU Status (currently):`
			`>> - Used memory (for this process): 879 / 7979.1875 MiB (11.02 % used)`
			`>> - Used memory (for all processes running on this GPU) 7935.3125 / 7979.1875 MiB (99.45 % used)`
			`>> - Temperature : 83 Celsius`
			`>> - Driver version : 435.21`
			`>> ---------------------------------------------------------------------------------------------------------------`
			`>>`
			`>>`
			`>>`
			`>> This message has been automatically send by a robot. Please don't answer to this mail`
			`>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool`


			`2. Training died (either finished well or not)`
			`>> From: <gpusstatus@mydomain.com>`
			`>> Subject : 1 processes running on <MACHINE_NAME> (LOCAL_IP_OF_THE_MACHINE)`
			`>>`
			`>> New events (triggered on the 08/11/2020 11:47:29):`
			`>>`
			`>> ---------------------------------------------------------------------------------------------------------------`
			`>> The process (PID : 12350) launched by araison since 08/11/2020 11:43:48 has ended.`
Update README.md 2020-11-09 09:38:42 +00:00			`>> His owner <owner_of_the_process> had executed the following command :`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00			`>> python3 test_torch.py`
			`>> From :`
			`>> <absolute_path_to_the_script_of_the_died_process>`
			`>>`
			`>> The process took 0:03:41 to finish.`
			`>> --------------------------------------------------------------------------------------------------------------`
			`>>`
			`>> This message has been automatically send by a robot. Please don't answer to this mail`
			`>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool`


Update 2024-04-28 15:02:12 +00:00			`## How to`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00

			`1. Cloning this repository :`

Update 2024-04-28 15:02:12 +00:00			`git clone https://gitea.zaclys.com/araison/gpus_monitor.git`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
			`2. Installing dependencies :`

			`pip3 install -r gpus_monitor/requirements.txt`

			`or`

Fixing error in README 2020-11-08 15:00:18 +00:00			`python3 gpus_monitor/setup.py install --user`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
			3. Add peoples mail to the list of the `persons_to_inform.yaml` file :

			`Example:`

			```yaml
			`list:`
Update 2024-04-28 15:02:12 +00:00			`- <email 1>`
			`- <other_person_to_inform@hisdomain.com>`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00			```



			`Note : You can hot-add/remove mails in this file without the need of killing the scanning process !`

			`4. Add SMTP Server parameters (server adress, credentials, port number, etc..)`

			You can manage these stuff in the `gpus_monitor/src/gpus_monitor/config.py` file :
			To adjust these varibales you have to edit the `gpus_monitor/src/gpus_monitor/config.py` file.

			```bash
			`cd gpus_monitor/src/gpus_monitor/`
			`vim config.py`
			```


Update 2024-04-28 15:02:12 +00:00			`For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables.`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
			```python
			`USER = os.environ.get(`
			`"GPUSMONITOR_MAIL_USER"`
			`)`

			`PASSWORD = os.environ.get(`
			`"GPUSMONITOR_MAIL_PASSWORD"`
			`)`
			`PORT = 465`
Update 2024-04-28 15:02:12 +00:00			`SMTP_SERVER = "smtp.example.com"`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00			```

			`See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables.`

			5. Adjust the scanning rate of `gpus_monitor` and the processes age that he has to take in account.


			The `WAITING_TIME` variable adjusts the scan timing rate of gpus_monitor.

			```python
			`WAITING_TIME = 0.5 # min`
			```

Update 2024-04-28 15:02:12 +00:00			The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to track down.
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
			```python
			`PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes)`
			```

			6. Executing `gpus_monitor` when machine starts up.

			```bash
			`crontab -e`
			```
			`Add the following line to the brandnew opened file :`

			```bash
			`@reboot python3 /path/to/gpu_monitor/src/gpus_monitor/main.py`
			```

Update README.md 2020-11-08 19:54:44 +00:00			`## Ideas to enhance the project :`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
			`If you have any ideas to improve this project, don't hesitate to make a merge request ! :)`


Update README.md 2020-11-09 09:38:42 +00:00			## To test `gpus_monitor` by your own:
Testing by your own the project in README 2020-11-08 14:55:49 +00:00
Update 2024-04-28 15:02:12 +00:00			`I've implemented the tiny non linear XOR problem in PyTorch.`
Testing by your own the project in README 2020-11-08 14:55:49 +00:00			You can test `gpus_monitor` by your own while running :
			```bash
			`python3 gpus_monitor/test_torch.py`
			```