Update
This commit is contained in:
parent
bae00013d3
commit
4ce469365c
40
README.md
40
README.md
@ -1,16 +1,17 @@
|
||||
# GPUs Monitor
|
||||
|
||||
`gpus_monitor` is a Pythonic NVIDIA ® GPUs activities monitoring tool designed to report (stdout, email, etc.) new or recently died computing processes.
|
||||
|
||||
## Use case
|
||||
|
||||
`gpus_monitor` is a Python GPUs activities monitoring tool designed to report by email new and recently died compute processes over the machine where it has been run on.
|
||||
Basically, when you have just run a new stable training on the machine where
|
||||
`gpus_monitor` listen to, you will received in a few seconds an email
|
||||
notification. This email will contains several informations about the process
|
||||
that has been launched.
|
||||
You will also receive an email if a compute process died (with EXIT_STATUS = 0 or not).
|
||||
`gpus_monitor` is listening to, you will received within few seconds an email
|
||||
notification about this brand new script launch. This email will contains several informations regarding this process.
|
||||
|
||||
You will also receive an email if a compute process died (either because of EXIT_STATUS = 0 or not).
|
||||
|
||||
|
||||
### Kind of mail gpus_monitor is going to send you :
|
||||
## Mail content gpus_monitor will send you :
|
||||
|
||||
1. New training detected :
|
||||
|
||||
@ -62,12 +63,12 @@ You will also receive an email if a compute process died (with EXIT_STATUS = 0 o
|
||||
>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool
|
||||
|
||||
|
||||
## Instructions to use gpus_monitor :
|
||||
## How to
|
||||
|
||||
|
||||
1. Cloning this repository :
|
||||
|
||||
`git clone https://github.com/araison12/gpus_monitor.git`
|
||||
`git clone https://gitea.zaclys.com/araison/gpus_monitor.git`
|
||||
|
||||
2. Installing dependencies :
|
||||
|
||||
@ -83,8 +84,8 @@ Example:
|
||||
|
||||
```yaml
|
||||
list:
|
||||
- adrien.raison@univ-poitiers.fr
|
||||
- other_person_to_inform@hisdomain.com
|
||||
- <email 1>
|
||||
- <other_person_to_inform@hisdomain.com>
|
||||
```
|
||||
|
||||
|
||||
@ -102,9 +103,7 @@ vim config.py
|
||||
```
|
||||
|
||||
|
||||
For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables. I've set up a brandnew Gmail account for my `gpus_monitor` instance. I can share with you my credentials in order to use a single SMTP account for each `gpus_monitor` instance listening several machine (max 100 mails/24h) , feel free to send me an email if you are interested in !
|
||||
Otherwise, fill in with your own SMTP server configuration.
|
||||
|
||||
For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables.
|
||||
|
||||
```python
|
||||
USER = os.environ.get(
|
||||
@ -115,7 +114,7 @@ PASSWORD = os.environ.get(
|
||||
"GPUSMONITOR_MAIL_PASSWORD"
|
||||
)
|
||||
PORT = 465
|
||||
SMTP_SERVER = "smtp.gmail.com"
|
||||
SMTP_SERVER = "smtp.example.com"
|
||||
```
|
||||
|
||||
See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables.
|
||||
@ -129,7 +128,7 @@ The `WAITING_TIME` variable adjusts the scan timing rate of gpus_monitor.
|
||||
WAITING_TIME = 0.5 # min
|
||||
```
|
||||
|
||||
The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to take in account.
|
||||
The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to track down.
|
||||
|
||||
```python
|
||||
PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes)
|
||||
@ -148,21 +147,12 @@ Add the following line to the brandnew opened file :
|
||||
|
||||
## Ideas to enhance the project :
|
||||
|
||||
- Log system (owner, total calculation time by user)
|
||||
- Manage cases (subject): processes finished well or not (Send Traceback)
|
||||
- Centralized system that scan every machine on a given IP adresses range.
|
||||
- Better errors management (SMTP connection failed, no Cuda compatible GPU on the machine,..)
|
||||
- Documenting the project
|
||||
- Rewrite it in oriented object fashion
|
||||
|
||||
|
||||
If you have any ideas to improve this project, don't hesitate to make a merge request ! :)
|
||||
|
||||
|
||||
|
||||
## To test `gpus_monitor` by your own:
|
||||
|
||||
I've implemented the tiny non linear XOR problem in pyTorch.
|
||||
I've implemented the tiny non linear XOR problem in PyTorch.
|
||||
You can test `gpus_monitor` by your own while running :
|
||||
```bash
|
||||
python3 gpus_monitor/test_torch.py
|
||||
|
Loading…
Reference in New Issue
Block a user