CUDA error: device-side assert triggered while fine tuning on my dataset

Hello everyone,

I’m currently using the Whisper-large-v3 model on an Nvidia A6000 graphics card with approximately 47 GB of RAM. I successfully fine-tuned this model on the ‘fa’ portion of the Mozilla 17.0 dataset without any issues. For this process, I followed fine-tune-whisper in OpenAI’s Hugging Face blog post and used the corresponding Colab notebook.

Now, I have my own dataset, which contains around 250 hours of data. When I attempted to fine-tune the Whisper model on my dataset using the same approach as with the Mozilla data, I encountered the following error multiple times:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"`
 failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"`
 failed.

The final error message I received was:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I tried to compile with the following settings, but I still encountered the same RuntimeError: CUDA error: device-side assert triggered error:

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['TORCH_USE_CUDA_DSA'] = '1'

Interestingly, when I adjusted the amount of data used for training, the step at which the error occurred also changed. For example, using the entire dataset caused the error on the 12th step, but limiting the dataset size pushed the error to the 80th step. I monitored the remaining RAM during training and found that I still had about 7 GB free.

I suspect the issue might be related to the labels. When I replaced my dataset’s original labels with random labels from the Mozilla dataset, the error disappeared. I used Whisper’s tokenizer to create the labels.

I would greatly appreciate any insights or suggestions!

Thank you!

1 Like