Hello to the VIAME/DIVE response team!
After trying to train a model or run a pipeline on VIAME/DIVE web, I received an error code. My last successful attempt was on February 28, 2025. I don't understand why this is happening now. Can you help?
Thanks :)
-----
ValueError: invalid literal for int() with base 10: 'Failed to initialize NVML: Unknown Error'
File "/opt/dive/local/venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task
R = retval = fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/opt/dive/local/venv/lib/python3.11/site-packages/girder_worker/task.py", line 154, in __call__
results = super().__call__(*_t_args, **_t_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/dive/local/venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__
return self.run(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/dive/src/dive_tasks/tasks.py", line 342, in train_pipeline
conf = Config()
^^^^^^^^
File "/opt/dive/src/dive_tasks/tasks.py", line 66, in __init__
self.gpu_process_env = get_gpu_environment()
^^^^^^^^^^^^^^^^^^^^^
File "/opt/dive/src/dive_tasks/tasks.py", line 53, in get_gpu_environment
gpus = [gpu.id for gpu in getGPUs() if gpu.uuid == gpu_uuid]
^^^^^^^^^
File "/opt/dive/local/venv/lib/python3.11/site-packages/GPUtil/GPUtil.py", line 102, in getGPUs
deviceIds = int(vals[i])
Hey, unfortunately this issue is unrelated to you and every 4-5 days the web server loses access to the GPU and needs a reboot ever since an NVIDIA driver update a month or two ago. We’re looking into the issue now, however.