Update
This commit is contained in:
parent
bae00013d3
commit
4ce469365c
40
README.md
40
README.md
|
@ -1,16 +1,17 @@
|
||||||
# GPUs Monitor
|
# GPUs Monitor
|
||||||
|
|
||||||
|
`gpus_monitor` is a Pythonic NVIDIA ® GPUs activities monitoring tool designed to report (stdout, email, etc.) new or recently died computing processes.
|
||||||
|
|
||||||
|
## Use case
|
||||||
|
|
||||||
`gpus_monitor` is a Python GPUs activities monitoring tool designed to report by email new and recently died compute processes over the machine where it has been run on.
|
|
||||||
Basically, when you have just run a new stable training on the machine where
|
Basically, when you have just run a new stable training on the machine where
|
||||||
`gpus_monitor` listen to, you will received in a few seconds an email
|
`gpus_monitor` is listening to, you will received within few seconds an email
|
||||||
notification. This email will contains several informations about the process
|
notification about this brand new script launch. This email will contains several informations regarding this process.
|
||||||
that has been launched.
|
|
||||||
You will also receive an email if a compute process died (with EXIT_STATUS = 0 or not).
|
You will also receive an email if a compute process died (either because of EXIT_STATUS = 0 or not).
|
||||||
|
|
||||||
|
|
||||||
### Kind of mail gpus_monitor is going to send you :
|
## Mail content gpus_monitor will send you :
|
||||||
|
|
||||||
1. New training detected :
|
1. New training detected :
|
||||||
|
|
||||||
|
@ -62,12 +63,12 @@ You will also receive an email if a compute process died (with EXIT_STATUS = 0 o
|
||||||
>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool
|
>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool
|
||||||
|
|
||||||
|
|
||||||
## Instructions to use gpus_monitor :
|
## How to
|
||||||
|
|
||||||
|
|
||||||
1. Cloning this repository :
|
1. Cloning this repository :
|
||||||
|
|
||||||
`git clone https://github.com/araison12/gpus_monitor.git`
|
`git clone https://gitea.zaclys.com/araison/gpus_monitor.git`
|
||||||
|
|
||||||
2. Installing dependencies :
|
2. Installing dependencies :
|
||||||
|
|
||||||
|
@ -83,8 +84,8 @@ Example:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
list:
|
list:
|
||||||
- adrien.raison@univ-poitiers.fr
|
- <email 1>
|
||||||
- other_person_to_inform@hisdomain.com
|
- <other_person_to_inform@hisdomain.com>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
@ -102,9 +103,7 @@ vim config.py
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables. I've set up a brandnew Gmail account for my `gpus_monitor` instance. I can share with you my credentials in order to use a single SMTP account for each `gpus_monitor` instance listening several machine (max 100 mails/24h) , feel free to send me an email if you are interested in !
|
For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables.
|
||||||
Otherwise, fill in with your own SMTP server configuration.
|
|
||||||
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
USER = os.environ.get(
|
USER = os.environ.get(
|
||||||
|
@ -115,7 +114,7 @@ PASSWORD = os.environ.get(
|
||||||
"GPUSMONITOR_MAIL_PASSWORD"
|
"GPUSMONITOR_MAIL_PASSWORD"
|
||||||
)
|
)
|
||||||
PORT = 465
|
PORT = 465
|
||||||
SMTP_SERVER = "smtp.gmail.com"
|
SMTP_SERVER = "smtp.example.com"
|
||||||
```
|
```
|
||||||
|
|
||||||
See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables.
|
See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables.
|
||||||
|
@ -129,7 +128,7 @@ The `WAITING_TIME` variable adjusts the scan timing rate of gpus_monitor.
|
||||||
WAITING_TIME = 0.5 # min
|
WAITING_TIME = 0.5 # min
|
||||||
```
|
```
|
||||||
|
|
||||||
The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to take in account.
|
The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to track down.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes)
|
PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes)
|
||||||
|
@ -148,21 +147,12 @@ Add the following line to the brandnew opened file :
|
||||||
|
|
||||||
## Ideas to enhance the project :
|
## Ideas to enhance the project :
|
||||||
|
|
||||||
- Log system (owner, total calculation time by user)
|
|
||||||
- Manage cases (subject): processes finished well or not (Send Traceback)
|
|
||||||
- Centralized system that scan every machine on a given IP adresses range.
|
|
||||||
- Better errors management (SMTP connection failed, no Cuda compatible GPU on the machine,..)
|
|
||||||
- Documenting the project
|
|
||||||
- Rewrite it in oriented object fashion
|
|
||||||
|
|
||||||
|
|
||||||
If you have any ideas to improve this project, don't hesitate to make a merge request ! :)
|
If you have any ideas to improve this project, don't hesitate to make a merge request ! :)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## To test `gpus_monitor` by your own:
|
## To test `gpus_monitor` by your own:
|
||||||
|
|
||||||
I've implemented the tiny non linear XOR problem in pyTorch.
|
I've implemented the tiny non linear XOR problem in PyTorch.
|
||||||
You can test `gpus_monitor` by your own while running :
|
You can test `gpus_monitor` by your own while running :
|
||||||
```bash
|
```bash
|
||||||
python3 gpus_monitor/test_torch.py
|
python3 gpus_monitor/test_torch.py
|
||||||
|
|
Loading…
Reference in New Issue