general exception crash / restarting from last good epoch

Hi all-

I was training a model using the desktop viame and the the code stopped executing with this general exception fault. I am wondering whether it would be possible to restart the model at the last good epoch, rather than running the model from the beginning. Also, if anyone knows whether this exception a common fault, and there is a fix, that would be good to know.

The computer is a lenovo/legion 32GB mem, rtx 4070, running windows 11.

INFO: general exception
harn.preferences = <FitHarnPreferences({‘keyboard_debug’: False, ‘snapshot_after_error’: True, ‘deploy_after_error’: True, ‘show_prog’: True, ‘prog_backend’: ‘progiter’, ‘ignore_inf_loss_parts’: False, ‘log_gradients’: False, ‘use_tensorboard’: True, ‘eager_dump_tensorboard’: True, ‘dump_tensorboard’: True, ‘tensorboard_groups’: [‘loss’], ‘export_modules’: [‘bioharn’], ‘export_on_init’: True, ‘large_loss’: 1000, ‘num_keep’: 10, ‘keep_freq’: 5, ‘timeout’: 1209600, ‘auto_prepare_batch’: False, ‘verbose’: 1, ‘log_resources’: True, ‘use_tqdm’: None, ‘colored’: True, ‘allow_unicode’: False}) at 0x21205e78070>
INFO: Attempting to checkpoint before crashing
INFO: Saving EXPLICIT snapshot to deep_training\fit\runs\viame-netharn-detector\hesvuwxi\explicit_checkpoints_epoch_00000036_2026-03-09T071653-5.pt
Traceback (most recent call last):
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 1594, in run
harn._run_tagged_epochs(
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 1822, in _run_tagged_epochs
harn._run_epoch(train_loader, tag=‘train’, learn=True)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 2064, in run_epoch
harn.backpropogate(bx, batch, loss)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 2504, in backpropogate
loss.backward()
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch_tensor.py”, line 630, in backward
torch.autograd.backward(
File "C:\Program Files\VIAME\lib\python3.10\site-packages\torch\autograd_init
.py", line 364, in backward
_engine_run_backward(
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch\autograd\graph.py”, line 865, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR_HOST_ALLOCATION_FAILED

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “C:\Program Files\VIAME\lib\python3.10\runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “C:\Program Files\VIAME\lib\python3.10\runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\detect_fit.py”, line 1405, in
fit()
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\detect_fit.py”, line 1324, in fit
return harn.run()
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 1674, in run
harn.save_snapshot(explicit=True)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\viame\pytorch\netharn\fit_harn.py”, line 1226, in save_snapshot
torch.save(snapshot_state, save_file)
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch\serialization.py”, line 977, in save
_save(
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch\serialization.py”, line 1284, in save
storage = storage.cpu()
File “C:\Program Files\VIAME\lib\python3.10\site-packages\torch\storage.py”, line 264, in cpu
return torch.UntypedStorage(self.size()).copy
(self, False)
torch.AcceleratorError: CUDA error: out of memory
Search for cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Which model were you training?

Your main error is: “CUDA error: out of memory” - this means your GPU is just running out of VRAM on the card. This is usually due to either:

  1. You launched some other application that is using VRAM too, like a video call, firefox playing video, some CAD program, etc…

  2. Your training config file / image size is just using too much VRAM. While we try to automatically set these values, it doesn’t always work. To resolve this, you would need to lower the ‘batch_size’ parameter down in the config system of whatever model you’re using, which can be specified differently depending on the model.

As for your other question about restarting where you trained off, if using the default netharn/cfrnn model, then a script like the following can accomplish it (if not the answer is still yes and you have to go into the config system for the respective detector):

Thanks Matt! This will help me out a lot. I am certain when I opened up a matlab instance and started overworking my laptop that is when it crapped out. Lesson learned!

c