Hi all-
I was training a model using the desktop viame and the the code stopped executing with this general exception fault. I am wondering whether it would be possible to restart the model at the last good epoch, rather than running the model from the beginning. Also, if anyone knows whether this exception a common fault, and there is a fix, that would be good to know.
The computer is a lenovo/legion 32GB mem, rtx 4070, running windows 11.
INFO: general exception
harn.preferences = <FitHarnPreferences({‘keyboard_debug’: False, ‘snapshot_after_error’: True, ‘deploy_after_error’: True, ‘show_prog’: True, ‘prog_backend’: ‘progiter’, ‘ignore_inf_loss_parts’: False, ‘log_gradients’: False, ‘use_tensorboard’: True, ‘eager_dump_tensorboard’: True, ‘dump_tensorboard’: True, ‘tensorboard_groups’: [‘loss’], ‘export_modules’: [‘bioharn’], ‘export_on_init’: True, ‘large_loss’: 1000, ‘num_keep’: 10, ‘keep_freq’: 5, ‘timeout’: 1209600, ‘auto_prepare_batch’: False, ‘verbose’: 1, ‘log_resources’: True, ‘use_tqdm’: None, ‘colored’: True, ‘allow_unicode’: False}) at 0x21205e78070>
INFO: Attempting to checkpoint before crashing
INFO: Saving EXPLICIT snapshot to deep_training\fit\runs\viame-netharn-detector\hesvuwxi\explicit_checkpoints_epoch_00000036_2026-03-09T071653-5.pt
Traceback (most recent call last):
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 1594, in run
harn._run_tagged_epochs(
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 1822, in _run_tagged_epochs
harn._run_epoch(train_loader, tag=‘train’, learn=True)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 2064, in run_epoch
harn.backpropogate(bx, batch, loss)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 2504, in backpropogate
loss.backward()
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch_tensor.py”, line 630, in backward
torch.autograd.backward(
File "C:\Program Files\VIAME\lib\python3.10\site-packages\torch\autograd_init.py", line 364, in backward
_engine_run_backward(
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch\autograd\graph.py”, line 865, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR_HOST_ALLOCATION_FAILED
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “C:\Program Files\VIAME\lib\python3.10\runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “C:\Program Files\VIAME\lib\python3.10\runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\detect_fit.py”, line 1405, in
fit()
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\detect_fit.py”, line 1324, in fit
return harn.run()
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 1674, in run
harn.save_snapshot(explicit=True)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 1226, in save_snapshot
torch.save(snapshot_state, save_file)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch\serialization.py”, line 977, in save
_save(
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch\serialization.py”, line 1284, in save
storage = storage.cpu()
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch\storage.py”, line 264, in cpu
return torch.UntypedStorage(self.size()).copy(self, False)
torch.AcceleratorError: CUDA error: out of memory
Search for cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.