CUDA error: device-side assert triggered while fine tuning on my dataset

Hello everyone,

I’m currently using the Whisper-large-v3 model on an Nvidia A6000 graphics card with approximately 47 GB of RAM. I successfully fine-tuned this model on the ‘fa’ portion of the Mozilla 17.0 dataset without any issues. For this process, I followed fine-tune-whisper in OpenAI’s Hugging Face blog post and used the corresponding Colab notebook.

Now, I have my own dataset, which contains around 250 hours of data. When I attempted to fine-tune the Whisper model on my dataset using the same approach as with the Mozilla data, I encountered the following error multiple times:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"`
 failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"`
 failed.

The final error message I received was:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I tried to compile with the following settings, but I still encountered the same RuntimeError: CUDA error: device-side assert triggered error:

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['TORCH_USE_CUDA_DSA'] = '1'

Interestingly, when I adjusted the amount of data used for training, the step at which the error occurred also changed. For example, using the entire dataset caused the error on the 12th step, but limiting the dataset size pushed the error to the 80th step. I monitored the remaining RAM during training and found that I still had about 7 GB free.

I suspect the issue might be related to the labels. When I replaced my dataset’s original labels with random labels from the Mozilla dataset, the error disappeared. I used Whisper’s tokenizer to create the labels.

I would greatly appreciate any insights or suggestions!

Thank you!

1 Like

Hello everyone,

I’ve resolved the issue! The problem was that the length of the labels exceeded Whisper’s max_target_positions configuration. For instance, the default max_target_positions for whisper-large-v3 is 448 tokens. You can either trim your labels or adjust the configuration.

Additionally, I submitted a pull request that aims to prevent such issues in future versions of transformers. For more information on similar issues, feel free to check out issue 1 and issue 2:

pull request: github [dot] com/huggingface/transformers/pull/33196
issue 1: github [dot] com/huggingface/datasets/issues/5391
issue 2: github [dot] com/huggingface/transformers/issues/27445

1 Like

@fakhimi.amirmohamad try below

# Initialize parameters and tokenizer
transcripts_folder = "path_to_transcripts"
max_target_length = 448
show_target_length = 400
tokenizer = InitializeTokenizer("model_name", language="en", task="transcribe")

# Define the function to check and delete faulty transcripts
function check_and_delete_transcripts():
    files = ListAllFiles(transcripts_folder, extension=".txt")  # Get all .txt files

    for file in ProgressBar(files, description="Checking transcripts"):
        content = ReadFile(file)  # Read the transcript content
        tokenized = TokenizeContent(tokenizer, content)  # Tokenize the content
        token_length = GetTokenLength(tokenized)  # Get the length of tokens

        if token_length > max_target_length:
            Print(f"Faulty file: {file}, Tokens: {token_length}")
            DeleteFile(file)  # Delete if too long
        elif token_length > show_target_length:
            Print(f"File exceeds show limit: {file}")
            PrintContent(content)  # Optionally delete or inspect further

# Execute the function
check_and_delete_transcripts()

this will help you get rid of faulty data then finetune the model