gpus_monitor/README.md

160 lines
5.3 KiB
Markdown
Raw Permalink Normal View History

2024-04-28 15:02:51 +00:00
# NVIDIA ® GPUs monitor
2024-04-28 15:02:12 +00:00
`gpus_monitor` is a Pythonic NVIDIA ® GPUs activities monitoring tool designed to report (stdout, email, etc.) new or recently died computing processes.
2024-04-28 15:02:12 +00:00
## Use case
2020-11-08 15:06:29 +00:00
Basically, when you have just run a new stable training on the machine where
2024-04-28 15:02:12 +00:00
`gpus_monitor` is listening to, you will received within few seconds an email
notification about this brand new script launch. This email will contains several informations regarding this process.
2024-04-28 15:02:12 +00:00
You will also receive an email if a compute process died (either because of EXIT_STATUS = 0 or not).
2024-04-28 15:02:12 +00:00
## Mail content gpus_monitor will send you :
1. New training detected :
>> From: <gpusstatus@mydomain.com>
>> Subject : 1 processes running on <MACHINE_NAME> (LOCAL_IP_OF_THE_MACHINE)
>>
>> New events (triggered on the 08/11/2020 11:45:52):
>>
>> ---------------------------------------------------------------------------------------------------------------
>> A new process (PID : 12350) has been launched on GPU 0 (Quadro RTX 4000) by <owner_of_the_process> since 08/11/2020 11:43:48
>> His owner (<owner_of_the_process>) has executed the following command :
>> python3 test_torch.py
>> From :
>> <absolute_path_to_the_script_of_the_new_launched_process>
>>
>> CPU Status (currently):
>> For this process : 19/40 logic cores (47.5%)
>>
>> GPU Status (currently):
>> - Used memory (for this process): 879 / 7979.1875 MiB (11.02 % used)
>> - Used memory (for all processes running on this GPU) 7935.3125 / 7979.1875 MiB (99.45 % used)
>> - Temperature : 83 Celsius
>> - Driver version : 435.21
>> ---------------------------------------------------------------------------------------------------------------
>>
>>
>>
>> This message has been automatically send by a robot. Please don't answer to this mail
>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool
2. Training died (either finished well or not)
>> From: <gpusstatus@mydomain.com>
>> Subject : 1 processes running on <MACHINE_NAME> (LOCAL_IP_OF_THE_MACHINE)
>>
>> New events (triggered on the 08/11/2020 11:47:29):
>>
>> ---------------------------------------------------------------------------------------------------------------
>> The process (PID : 12350) launched by araison since 08/11/2020 11:43:48 has ended.
2020-11-09 09:38:42 +00:00
>> His owner <owner_of_the_process> had executed the following command :
>> python3 test_torch.py
>> From :
>> <absolute_path_to_the_script_of_the_died_process>
>>
>> The process took 0:03:41 to finish.
>> --------------------------------------------------------------------------------------------------------------
>>
>> This message has been automatically send by a robot. Please don't answer to this mail
>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool
2024-04-28 15:02:12 +00:00
## How to
1. Cloning this repository :
2024-04-28 15:02:12 +00:00
`git clone https://gitea.zaclys.com/araison/gpus_monitor.git`
2. Installing dependencies :
`pip3 install -r gpus_monitor/requirements.txt`
or
2020-11-08 15:00:18 +00:00
`python3 gpus_monitor/setup.py install --user`
3. Add peoples mail to the list of the `persons_to_inform.yaml` file :
Example:
```yaml
list:
2024-04-28 15:02:12 +00:00
- <email 1>
- <other_person_to_inform@hisdomain.com>
```
Note : You can hot-add/remove mails in this file without the need of killing the scanning process !
4. Add SMTP Server parameters (server adress, credentials, port number, etc..)
You can manage these stuff in the `gpus_monitor/src/gpus_monitor/config.py` file :
To adjust these varibales you have to edit the `gpus_monitor/src/gpus_monitor/config.py` file.
```bash
cd gpus_monitor/src/gpus_monitor/
vim config.py
```
2024-04-28 15:02:12 +00:00
For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables.
```python
USER = os.environ.get(
"GPUSMONITOR_MAIL_USER"
)
PASSWORD = os.environ.get(
"GPUSMONITOR_MAIL_PASSWORD"
)
PORT = 465
2024-04-28 15:02:12 +00:00
SMTP_SERVER = "smtp.example.com"
```
See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables.
5. Adjust the scanning rate of `gpus_monitor` and the processes age that he has to take in account.
The `WAITING_TIME` variable adjusts the scan timing rate of gpus_monitor.
```python
WAITING_TIME = 0.5 # min
```
2024-04-28 15:02:12 +00:00
The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to track down.
```python
PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes)
```
6. Executing `gpus_monitor` when machine starts up.
```bash
crontab -e
```
Add the following line to the brandnew opened file :
```bash
@reboot python3 /path/to/gpu_monitor/src/gpus_monitor/main.py
```
2020-11-08 19:54:44 +00:00
## Ideas to enhance the project :
If you have any ideas to improve this project, don't hesitate to make a merge request ! :)
2020-11-09 09:38:42 +00:00
## To test `gpus_monitor` by your own:
2024-04-28 15:02:12 +00:00
I've implemented the tiny non linear XOR problem in PyTorch.
You can test `gpus_monitor` by your own while running :
```bash
python3 gpus_monitor/test_torch.py
```