diff --git a/CHANGELOG.rst b/CHANGELOG.rst index 226e6f5..de698c0 100644 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -7,4 +7,3 @@ Version 0.1 - Feature A added - FIX: nasty bug #1729 fixed -- add your changes here! diff --git a/README.rst b/README.rst index 910bac2..b43e29d 100644 --- a/README.rst +++ b/README.rst @@ -1,19 +1,160 @@ -============ -gpus_monitor -============ +# GPUs Monitor -Add a short description here! + +`gpus_monitor` is a Python GPUs activities monitoring tool designed to report by email new and recently died compute processes over the machine where it has been run on. +Basically, when you have just run a new stable training on the machine where `gpus_monitor` listen to, you will received in a few seconds an email notification. This email will contains several informations about the process you've launched. +You received also an email if a compute process died (with EXIT_STATUS = 0 or not). -Description -=========== +### Kind of mail gpus_monitor is going to send you : + +1. New training detected : + +>> From: +>> Subject : 1 processes running on (LOCAL_IP_OF_THE_MACHINE) +>> +>> New events (triggered on the 08/11/2020 11:45:52): +>> +>> --------------------------------------------------------------------------------------------------------------- +>> A new process (PID : 12350) has been launched on GPU 0 (Quadro RTX 4000) by since 08/11/2020 11:43:48 +>> His owner () has executed the following command : +>> python3 test_torch.py +>> From : +>> +>> +>> CPU Status (currently): +>> For this process : 19/40 logic cores (47.5%) +>> +>> GPU Status (currently): +>> - Used memory (for this process): 879 / 7979.1875 MiB (11.02 % used) +>> - Used memory (for all processes running on this GPU) 7935.3125 / 7979.1875 MiB (99.45 % used) +>> - Temperature : 83 Celsius +>> - Driver version : 435.21 +>> --------------------------------------------------------------------------------------------------------------- +>> +>> +>> +>> This message has been automatically send by a robot. Please don't answer to this mail +>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool + + +2. Training died (either finished well or not) +>> From: +>> Subject : 1 processes running on (LOCAL_IP_OF_THE_MACHINE) +>> +>> New events (triggered on the 08/11/2020 11:47:29): +>> +>> --------------------------------------------------------------------------------------------------------------- +>> The process (PID : 12350) launched by araison since 08/11/2020 11:43:48 has ended. +>> His owner araison had executed the following command : +>> python3 test_torch.py +>> From : +>> +>> +>> The process took 0:03:41 to finish. +>> -------------------------------------------------------------------------------------------------------------- +>> +>> This message has been automatically send by a robot. Please don't answer to this mail +>> Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool + + +## Instructions to use gpus_monitor : + + +1. Cloning this repository : + +`git clone https://github.com/araison12/gpus_monitor.git` + +2. Installing dependencies : + +`pip3 install -r gpus_monitor/requirements.txt` + +or + +`python3 setup.py install --user` + +3. Add peoples mail to the list of the `persons_to_inform.yaml` file : + +Example: + +```yaml +list: + - adrien.raison@univ-poitiers.fr + - other_person_to_inform@hisdomain.com +``` + + + +Note : You can hot-add/remove mails in this file without the need of killing the scanning process ! + +4. Add SMTP Server parameters (server adress, credentials, port number, etc..) + +You can manage these stuff in the `gpus_monitor/src/gpus_monitor/config.py` file : +To adjust these varibales you have to edit the `gpus_monitor/src/gpus_monitor/config.py` file. + +```bash +cd gpus_monitor/src/gpus_monitor/ +vim config.py +``` + + +For privacy purposes, login of my dedicated SMTP account are stored in 2 machine environment variables. I've set up a brandnew Gmail account for my `gpus_monitor` instance. I can share with you my credentials in order to use a single SMTP account for `gpus_monitor` instance on several machines, feel free to send me an email ! +Otherwise, fill in with your own SMTP server configuration. + + +```python +USER = os.environ.get( + "GPUSMONITOR_MAIL_USER" +) + +PASSWORD = os.environ.get( + "GPUSMONITOR_MAIL_PASSWORD" +) +PORT = 465 +SMTP_SERVER = "smtp.gmail.com" +``` + +See https://askubuntu.com/a/58828 to handle efficiently (permanent adding) environment variables. + +5. Adjust the scanning rate of `gpus_monitor` and the processes age that he has to take in account. + + +The `WAITING_TIME` variable adjusts the scan timing rate of gpus_monitor. + +```python +WAITING_TIME = 0.5 # min +``` + +The `PROCESS_AGE` variable adjusts the processes age that gpus_monitor has to take in account. + +```python +PROCESS_AGE = 2 # min (gpus_monitor only consider >=2min aged processes) +``` + +6. Executing `gpus_monitor` when machine starts up. + +```bash +crontab -e +``` +Add the following line to the brandnew opened file : + +```bash +@reboot python3 /path/to/gpu_monitor/src/gpus_monitor/main.py +``` + +# Ideas to enhanced the project : + +- Log system (owner, total calculation time by user) +- Manage cases in email sending (subject): processes finished well or not (Send Traceback) +- Centralized system that scan every machine on a given IP adresses range. +- Better errors management (SMTP connection failed, no Cuda GPU on the machine,..) +- Documenting the project +- Rewrite it in oriented object fashion + + +If you have any ideas to improve this project, don't hesitate to make a merge request ! :) -A longer description of your project goes here... -Note -==== -This project has been set up using PyScaffold 3.2.3. For details and usage -information on PyScaffold see https://pyscaffold.org/. diff --git a/persons_to_inform.yaml b/persons_to_inform.yaml new file mode 100644 index 0000000..f023bb0 --- /dev/null +++ b/persons_to_inform.yaml @@ -0,0 +1,5 @@ +--- + +list: + - adrien.raison@univ-poitiers.fr + # - Add your mail here and uncomment this line \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 3ce41f4..024a1c2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,17 +1,4 @@ -# ============================================================================= -# DEPRECATION WARNING: -# -# The file `requirements.txt` does not influence the package dependencies and -# will not be automatically created in the next version of PyScaffold (v4.x). -# -# Please have look at the docs for better alternatives -# (`Dependency Management` section). -# ============================================================================= -# -# Add your pinned requirements so that they can be easily installed with: -# pip install -r requirements.txt -# Remember to also add them in setup.cfg but unpinned. -# Example: -# numpy==1.13.3 -# scipy==1.0 -# +yaml +psutil +netifaces +pynvml \ No newline at end of file diff --git a/setup.cfg b/setup.cfg index f0dbb55..45d8641 100644 --- a/setup.cfg +++ b/setup.cfg @@ -4,15 +4,15 @@ [metadata] name = gpus_monitor -description = Add a short description here! +description = GPUs Monitoring tool for machine learning purposes author = araison author-email = adrien.raison@univ-poitiers.fr license = proprietary long-description = file: README.rst long-description-content-type = text/x-rst; charset=UTF-8 -url = https://github.com/pyscaffold/pyscaffold/ +url = https://github.com/araison12/gpus_monitor project-urls = - Documentation = https://pyscaffold.org/ +# Documentation = # Change if running only on Windows, Mac or Linux (comma-separated) platforms = any # Add here all kinds of additional classifiers as defined under @@ -30,7 +30,7 @@ package_dir = # DON'T CHANGE THE FOLLOWING LINE! IT WILL BE UPDATED BY PYSCAFFOLD! setup_requires = pyscaffold>=3.2a0,<3.3a0 # Add here dependencies of your project (semicolon/line-separated), e.g. -# install_requires = numpy; scipy +install_requires = yaml;psutil;netifaces;pynvml # The usage of test_requires is discouraged, see `Dependency Management` docs # tests_require = pytest; pytest-cov # Require a specific Python version, e.g. Python 2.7 or >= 3.4 diff --git a/src/gpus_monitor/.vscode/settings.json b/src/gpus_monitor/.vscode/settings.json new file mode 100644 index 0000000..615aafb --- /dev/null +++ b/src/gpus_monitor/.vscode/settings.json @@ -0,0 +1,3 @@ +{ + "python.pythonPath": "/usr/bin/python3" +} \ No newline at end of file diff --git a/src/gpus_monitor/config.py b/src/gpus_monitor/config.py new file mode 100644 index 0000000..44acfe9 --- /dev/null +++ b/src/gpus_monitor/config.py @@ -0,0 +1,25 @@ +import os +import yaml + +# SMTP SERVER CONFIG + + +USER = os.environ.get( + "GPUSMONITOR_MAIL_USER" +) # for the moment, only accessible from SIC08005 (all users) +PASSWORD = os.environ.get( + "GPUSMONITOR_MAIL_PASSWORD" +) # for the moment, only accessible from SIC08005 (all users) +PORT = 465 +SMTP_SERVER = "smtp.gmail.com" + +# SCANNING RATE + +WAITING_TIME = 0.5 # min +PROCESS_AGE = 2 # min + +# PERSONS TO INFORM +def persons_to_inform(): + with open("../../persons_to_inform.yaml", "r") as yaml_file: + PERSON_TO_INFORM_LIST = yaml.load(yaml_file, Loader=yaml.FullLoader)["list"] + return PERSON_TO_INFORM_LIST diff --git a/src/gpus_monitor/main.py b/src/gpus_monitor/main.py new file mode 100644 index 0000000..a63d0d5 --- /dev/null +++ b/src/gpus_monitor/main.py @@ -0,0 +1,118 @@ +import time +import tools +import psutil +import config +import datetime + + +def main(): + + machine_infos = tools.get_machine_infos() + processes_to_monitor_pid = [] + processes_to_monitor = [] + + while True: + time.sleep(config.WAITING_TIME * 60) + current_time = time.time() + gpus_info = tools.gpus_snap_info() + driver_version = gpus_info["driver_version"] + news = "" + send_info = False + new_processes_count = 0 + died_processes_count = 0 + for index, gpu in enumerate(gpus_info["gpu"]): + gpu_name = gpu["product_name"] + processes = gpu["processes"] + if processes == "N/A": + news += f"Nothing is running on GPU {index} ({gpu_name})\n" + continue + else: + for p in processes: + pid = p["pid"] + p_info = tools.process_info(pid) + if ( + current_time - p_info["since"] >= config.PROCESS_AGE * 60 + and not pid in processes_to_monitor_pid + ): + send_info = True + new_processes_count += 1 + news += f""" + + --------------------------------------------------------------------------------------------------------------- + A new process (PID : {pid}) has been launched on GPU {index} ({gpu_name}) by {p_info['owner']} since {datetime.datetime.fromtimestamp(int(p_info['since'])).strftime("%d/%m/%Y %H:%M:%S")} + His owner ({p_info['owner']}) has executed the following command : + {' '.join(p_info['executed_cmd'])} + From : + {p_info['from']} + + CPU Status (currently): + For this process : {p_info['cpu_core_required']} + + GPU Status (currently): + - Used memory (for this process): {p['used_memory']} / {gpu['fb_memory_usage']['total']} {gpu['fb_memory_usage']['unit']} ({round(p['used_memory']/gpu['fb_memory_usage']['total']*100,2)} % used) + - Used memory (for all processes running on this GPU) {gpu['fb_memory_usage']['used']} / {gpu['fb_memory_usage']['total']} {gpu['fb_memory_usage']['unit']} ({round(gpu['fb_memory_usage']['used']/gpu['fb_memory_usage']['total']*100,2)} % used) + - Temperature : {gpu["temperature"]["gpu_temp"]} Celsius + - Driver version : {driver_version} + --------------------------------------------------------------------------------------------------------------- + + + """ + processes_to_monitor.append(p_info) + processes_to_monitor_pid.append(pid) + else: + continue + + for p in processes_to_monitor[:]: + pid = p["pid"] + try: + still_running_p = psutil.Process(pid) + continue + except psutil.NoSuchProcess: + send_info = True + died_processes_count += 1 + news += f""" + + --------------------------------------------------------------------------------------------------------------- + The process (PID : {pid}) launched by {p['owner']} since {datetime.datetime.fromtimestamp(int(p['since'])).strftime("%d/%m/%Y %H:%M:%S")} has ended. + His owner {p_info['owner']} had executed the following command : + {' '.join(p['executed_cmd'])} + From : + {p['from']} + + The process took {datetime.timedelta(seconds=int(current_time)-int(p['since']))} to finish. + --------------------------------------------------------------------------------------------------------------- + + + """ + processes_to_monitor.remove(p) + processes_to_monitor_pid.remove(pid) + + subject = None + + if new_processes_count > 0: + subject = f"{new_processes_count} processes running on {machine_infos['MACHINE_NAME']} ({machine_infos['LOCAL_IP']})" + elif died_processes_count > 0: + subject = f"{died_processes_count} processes died on {machine_infos['MACHINE_NAME']} ({machine_infos['LOCAL_IP']})" + else: + subject = "Error" + + now = datetime.datetime.now() + dt_string = now.strftime("%d/%m/%Y %H:%M:%S") + global_message = f""" + + New events (triggered on the {dt_string}): + {news} + + This message has been automatically send by a robot. Please don't answer to this mail. + + Please, feel free to open a merge request on github.com/araison12/gpus_monitor if you have encountered a bug or to share your ideas to improve this tool :) + + """ + + if send_info: + for person in config.persons_to_inform(): + tools.send_mail(subject, global_message, person) + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/gpus_monitor/tools.py b/src/gpus_monitor/tools.py new file mode 100644 index 0000000..c13a6ac --- /dev/null +++ b/src/gpus_monitor/tools.py @@ -0,0 +1,65 @@ +import os +import ssl +import time +import config +import psutil +import smtplib +import netifaces as ni +from socket import gaierror +from pynvml.smi import nvidia_smi +from email.message import EmailMessage + + +def send_mail(subject, message, receiver): + + context = ssl.create_default_context() + msg = EmailMessage() + msg.set_content(message) + msg["Subject"] = subject + msg["From"] = config.USER + msg["To"] = receiver + + with smtplib.SMTP_SSL(config.SMTP_SERVER, config.PORT, context=context) as server: + server.login(config.USER, config.PASSWORD) + server.send_message(msg) + + +def get_machine_local_ip(): + interfaces = ni.interfaces() + for inter in interfaces: + if ni.ifaddresses(inter)[ni.AF_INET][0]["addr"][:7] == "194.167": + return ni.ifaddresses(inter)[ni.AF_INET][0]["addr"] + + +def get_machine_infos(): + infos_uname = os.uname() + + return { + "OS_TYPE": infos_uname[0], + "MACHINE_NAME": infos_uname[1], + "LOCAL_IP": get_machine_local_ip(), + } + + +def gpus_snap_info(): + nvsmi = nvidia_smi.getInstance() + return nvsmi.DeviceQuery( + "memory.free,memory.total,memory.used,compute-apps,temperature.gpu,driver_version,timestamp,name" + ) + + +def process_info(pid): + try: + process = psutil.Process(pid) + + return { + "pid": pid, + "owner": process.username(), + "executed_cmd": process.cmdline(), + "from": process.cwd(), + "since": process.create_time(), + "is_running": process.is_running(), + "cpu_core_required": f"{process.cpu_num()}/{os.cpu_count()} logic cores ({process.cpu_num()*100/os.cpu_count()}%)", + } + except psutil.NoSuchProcess: + return None \ No newline at end of file diff --git a/test_torch.py b/test_torch.py new file mode 100644 index 0000000..bc7f2f5 --- /dev/null +++ b/test_torch.py @@ -0,0 +1,49 @@ +import torch +from torch.autograd import Variable +import torch.nn as nn +import torch.nn.functional as F +import torch.optim as optim + +EPOCHS_TO_TRAIN = 50000000000 +device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") +if not torch.cuda.is_available(): + print(f'The current used device for this testing training is {device}. Please make sure that at least one Cuda device is used instead for this testing training.') + assert False + +class Net(nn.Module): + def __init__(self): + super(Net, self).__init__() + self.fc1 = nn.Linear(2, 3, True) + self.fc2 = nn.Linear(3, 1, True) + + def forward(self, x): + x = F.sigmoid(self.fc1(x)) + x = self.fc2(x) + return x + + +net = Net().to(device) +print(device) +inputs = list( + map( + lambda s: Variable(torch.Tensor([s]).to(device)), + [[0, 0], [0, 1], [1, 0], [1, 1]], + ) +) +targets = list( + map(lambda s: Variable(torch.Tensor([s]).to(device)), [[0], [1], [1], [0]]) +) + +criterion = nn.MSELoss() +optimizer = optim.SGD(net.parameters(), lr=0.01) + +print("Training loop:") +for idx in range(0, EPOCHS_TO_TRAIN): + for input, target in zip(inputs, targets): + optimizer.zero_grad() # zero the gradient buffers + output = net(input) + loss = criterion(output, target) + loss.backward() + optimizer.step() # Does the update + if idx % 5000 == 0: + print(loss.item())