From 4ce469365cfe697582ec07d9908a8fe952d085d6 Mon Sep 17 00:00:00 2001 From: araison Date: Sun, 28 Apr 2024 17:02:12 +0200 Subject: [PATCH] Update --- README.md | 40 +++++++++++++++------------------------- 1 file changed, 15 insertions(+), 25 deletions(-) diff --git a/README.md b/README.md index edea61f..ce999da 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,17 @@ # GPUs Monitor +`gpus_monitor` is a Pythonic NVIDIA ® GPUs activities monitoring tool designed to report (stdout, email, etc.) new or recently died computing processes. +## Use case -`gpus_monitor` is a Python GPUs activities monitoring tool designed to report by email new and recently died compute processes over the machine where it has been run on. Basically, when you have just run a new stable training on the machine where -`gpus_monitor` listen to, you will received in a few seconds an email -notification. This email will contains several informations about the process -that has been launched. -You will also receive an email if a compute process died (with EXIT_STATUS = 0 or not). +`gpus_monitor` is listening to, you will received within few seconds an email +notification about this brand new script launch. This email will contains several informations regarding this process. + +You will also receive an email if a compute process died (either because of EXIT_STATUS = 0 or not). -### Kind of mail gpus_monitor is going to send you : +## Mail content gpus_monitor will send you : 1. New training detected : @@ -62,12 +63,12 @@ You will also receive an email if a compute process died (with EXIT_STATUS = 0 o >> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool -## Instructions to use gpus_monitor : +## How to 1. Cloning this repository : -`git clone https://github.com/araison12/gpus_monitor.git` +`git clone https://gitea.zaclys.com/araison/gpus_monitor.git` 2. Installing dependencies : @@ -83,8 +84,8 @@ Example: ```yaml list: - - adrien.raison@univ-poitiers.fr - - other_person_to_inform@hisdomain.com + - + - ``` @@ -102,9 +103,7 @@ vim config.py ``` -For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables. I've set up a brandnew Gmail account for my `gpus_monitor` instance. I can share with you my credentials in order to use a single SMTP account for each `gpus_monitor` instance listening several machine (max 100 mails/24h) , feel free to send me an email if you are interested in ! -Otherwise, fill in with your own SMTP server configuration. - +For privacy purposes, login of my dedicated SMTP account are stored in a machine in 2 environment variables. ```python USER = os.environ.get( @@ -115,7 +114,7 @@ PASSWORD = os.environ.get( "GPUSMONITOR_MAIL_PASSWORD" ) PORT = 465 -SMTP_SERVER = "smtp.gmail.com" +SMTP_SERVER = "smtp.example.com" ``` See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables. @@ -129,7 +128,7 @@ The `WAITING_TIME` variable adjusts the scan timing rate of gpus_monitor. WAITING_TIME = 0.5 # min ``` -The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to take in account. +The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to track down. ```python PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes) @@ -148,21 +147,12 @@ Add the following line to the brandnew opened file : ## Ideas to enhance the project : -- Log system (owner, total calculation time by user) -- Manage cases (subject): processes finished well or not (Send Traceback) -- Centralized system that scan every machine on a given IP adresses range. -- Better errors management (SMTP connection failed, no Cuda compatible GPU on the machine,..) -- Documenting the project -- Rewrite it in oriented object fashion - - If you have any ideas to improve this project, don't hesitate to make a merge request ! :) - ## To test `gpus_monitor` by your own: -I've implemented the tiny non linear XOR problem in pyTorch. +I've implemented the tiny non linear XOR problem in PyTorch. You can test `gpus_monitor` by your own while running : ```bash python3 gpus_monitor/test_torch.py