CUDA error: device-side assert triggered while fine tuning on my dataset

fakhimi.amirmohamad · August 27, 2024, 10:43am

Hello everyone,

I’m currently using the Whisper-large-v3 model on an Nvidia A6000 graphics card with approximately 47 GB of RAM. I successfully fine-tuned this model on the ‘fa’ portion of the Mozilla 17.0 dataset without any issues. For this process, I followed fine-tune-whisper in OpenAI’s Hugging Face blog post and used the corresponding Colab notebook.

Now, I have my own dataset, which contains around 250 hours of data. When I attempted to fine-tune the Whisper model on my dataset using the same approach as with the Mozilla data, I encountered the following error multiple times:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"`
 failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"`
 failed.

The final error message I received was:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I tried to compile with the following settings, but I still encountered the same RuntimeError: CUDA error: device-side assert triggered error:

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['TORCH_USE_CUDA_DSA'] = '1'

Interestingly, when I adjusted the amount of data used for training, the step at which the error occurred also changed. For example, using the entire dataset caused the error on the 12th step, but limiting the dataset size pushed the error to the 80th step. I monitored the remaining RAM during training and found that I still had about 7 GB free.

I suspect the issue might be related to the labels. When I replaced my dataset’s original labels with random labels from the Mozilla dataset, the error disappeared. I used Whisper’s tokenizer to create the labels.

I would greatly appreciate any insights or suggestions!

Thank you!

fakhimi.amirmohamad · September 30, 2024, 9:16pm

Hello everyone,

I’ve resolved the issue! The problem was that the length of the labels exceeded Whisper’s max_target_positions configuration. For instance, the default max_target_positions for whisper-large-v3 is 448 tokens. You can either trim your labels or adjust the configuration.

Additionally, I submitted a pull request that aims to prevent such issues in future versions of transformers. For more information on similar issues, feel free to check out issue 1 and issue 2:

pull request: github [dot] com/huggingface/transformers/pull/33196
issue 1: github [dot] com/huggingface/datasets/issues/5391
issue 2: github [dot] com/huggingface/transformers/issues/27445

aint515z · November 15, 2024, 10:22am

@fakhimi.amirmohamad try below

# Initialize parameters and tokenizer
transcripts_folder = "path_to_transcripts"
max_target_length = 448
show_target_length = 400
tokenizer = InitializeTokenizer("model_name", language="en", task="transcribe")

# Define the function to check and delete faulty transcripts
function check_and_delete_transcripts():
    files = ListAllFiles(transcripts_folder, extension=".txt")  # Get all .txt files

    for file in ProgressBar(files, description="Checking transcripts"):
        content = ReadFile(file)  # Read the transcript content
        tokenized = TokenizeContent(tokenizer, content)  # Tokenize the content
        token_length = GetTokenLength(tokenized)  # Get the length of tokens

        if token_length > max_target_length:
            Print(f"Faulty file: {file}, Tokens: {token_length}")
            DeleteFile(file)  # Delete if too long
        elif token_length > show_target_length:
            Print(f"File exceeds show limit: {file}")
            PrintContent(content)  # Optionally delete or inspect further

# Execute the function
check_and_delete_transcripts()

this will help you get rid of faulty data then finetune the model

Topic		Replies	Views
Troubleshooting OpenAI's Whisper Model: Resolving Incorrect Language Outputs for Maithili with Multilanguage Tokenizer Community whisper	1	162	September 18, 2024
OutOfMemoryError: CUDA out of memory. [ File loading a GPT-2 model] Community fine-tuning	1	874	June 12, 2023
Running Whisper on AWS GPU - Memory Error API whisper	4	3164	March 6, 2023
Whisper stalls consistently after specifically 1000 files during epoch Bugs whisper	1	70	April 8, 2025
Fine Tuning Job Failure: Internal Error Bugs	0	44	July 22, 2025

CUDA error: device-side assert triggered while fine tuning on my dataset

Related topics