VIAME/DIVE web ValueError : GPU detection and initialization

Hello to the VIAME/DIVE response team!

After trying to train a model or run a pipeline on VIAME/DIVE web, I received an error code. My last successful attempt was on February 28, 2025. I don't understand why this is happening now. Can you help?

Thanks :)
-----

ValueError: invalid literal for int() with base 10: 'Failed to initialize NVML: Unknown Error'

  File "/opt/dive/local/venv/lib/python3.11/site-packages/celery/app/trace.py", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/opt/dive/local/venv/lib/python3.11/site-packages/girder_worker/task.py", line 154, in __call__
    results = super().__call__(*_t_args, **_t_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dive/local/venv/lib/python3.11/site-packages/celery/app/trace.py", line 736, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dive/src/dive_tasks/tasks.py", line 342, in train_pipeline
    conf = Config()
           ^^^^^^^^
  File "/opt/dive/src/dive_tasks/tasks.py", line 66, in __init__
    self.gpu_process_env = get_gpu_environment()
                           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dive/src/dive_tasks/tasks.py", line 53, in get_gpu_environment
    gpus = [gpu.id for gpu in getGPUs() if gpu.uuid == gpu_uuid]
                              ^^^^^^^^^
  File "/opt/dive/local/venv/lib/python3.11/site-packages/GPUtil/GPUtil.py", line 102, in getGPUs
    deviceIds = int(vals[i])

Hey, unfortunately this issue is unrelated to you and every 4-5 days the web server loses access to the GPU and needs a reboot ever since an NVIDIA driver update a month or two ago. We’re looking into the issue now, however.