Fine Tuning: Job_response: "message": "file has 4 example(s), but must have at least 10 examples",

Hello - new to fine-tuning GPT3.5-turbo and having an issue uploading my .JSONL dataset.

The full output of the command is:

$ ./fine_tuning_steps_updated.sh
Uploaded file. Received TRAINING_FILE_ID: file-872Htvk4v7WH9YoyigUe8fNj
Waiting for the file to be ready for fine-tuning...
JOB_RESPONSE: {
  "error": {
    "message": "file-872Htvk4v7WH9YoyigUe8fNj has 4 example(s), but must have at least 10 examples",
    "type": "invalid_request_error",
    "param": null,
    "code": "invalid_n_examples"
  }
}

The data set is a set of 100 email threads, with inquiries and replies.

I am confused about the ā€˜Has 4 examples, but must have at least 10 examplesā€™ and I can find no reference to this anywhere in the forums or general web searches.

I have manually confirmed the structure of the dataset is ideal, Iā€™ve counted 100 examples in the data set, and still Iā€™m getting the same error with no idea what to do next.

Any guidance would be appreciated, Iā€™ve consulted the API documentation on fine tuning time and time again, but I just canā€™t make sense of the issue.

Edit:

Wanted to add my full fine-tuning script.sh

#!/bin/bash

# Define variables
API_KEY="*****************OMITTED***********"
TRAINING_FILE_PATH="converted_data.jsonl"

# Step 1: Upload the file to OpenAI
UPLOAD_RESPONSE=$(curl -v https://api.openai.com/v1/files \
     -H "Authorization: Bearer $API_KEY" \
     -F "purpose=fine-tune" \
     -F "file=@$TRAINING_FILE_PATH")

TRAINING_FILE_ID=$(echo $UPLOAD_RESPONSE | jq -r '.id')

echo "Uploaded file. Received TRAINING_FILE_ID: $TRAINING_FILE_ID"
echo "Full upload response: $UPLOAD_RESPONSE"

# Step 2: Create a Fine-Tuning Job
JOB_RESPONSE=$(curl -s https://api.openai.com/v1/fine_tuning/jobs \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $API_KEY" \
     -d '{
       "training_file": "'$TRAINING_FILE_ID'",
       "model": "gpt-3.5-turbo-0613"
     }')

JOB_ID=$(echo $JOB_RESPONSE | jq -r '.id')

echo "Fine-tuning job initiated with JOB_ID: $JOB_ID"

# Step 3: Periodically check the status of the fine-tuning job
while true; do
    STATUS_RESPONSE=$(curl -s https://api.openai.com/v1/fine_tuning/jobs/$JOB_ID \
        -H "Authorization: Bearer $API_KEY")
    
    STATUS=$(echo $STATUS_RESPONSE | jq -r '.status')

    if [[ $STATUS == "succeeded" ]]; then
        MODEL_ID=$(echo $STATUS_RESPONSE | jq -r '.fine_tuned_model')
        echo "Fine-tuning succeeded! Fine-tuned Model ID: $MODEL_ID"
        break
    elif [[ $STATUS == "failed" ]]; then
        echo "Fine-tuning failed. Exiting."
        exit 1
    else
        echo "Job status: $STATUS. Waiting for 5 minutes before checking again..."
        sleep 300  # Wait for 5 minutes
    fi
done

Sounds like the json of your dataset is wrong, but without seeing it, canā€™t confirmā€¦ Iā€™d look there and confirm you have it set-up correctly, especially if itā€™s dynamically generated from other data.

1 Like

I had suspected it was my dataset/formatting, so I whipped up another smaller set of dummy data for testing.

Hereā€™s an example I put together quickly, and tested - with the same result:

{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "Tell me a story."}, {"role": "assistant", "content": "One day a student went to schoool."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "It's Paaris."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "Who won the world cup in 2018?"}, {"role": "assistant", "content": "Francee won the world cup in 2018."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "What is the largest planet?"}, {"role": "assistant", "content": "The largestt planet is Jupiter."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "What is photosynthesis?"}, {"role": "assistant", "content": "It's the proccess by which plants make food using sunlight."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "Who is Albert Einstein?"}, {"role": "assistant", "content": "He's a famous physiccist known for the theory of relativity."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "What is the Pythagorean theorem?"}, {"role": "assistant", "content": "It's a fundametal relation in Euclidean geometry among thee three sides of a right triangle."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "How do birds fly?"}, {"role": "assistant", "content": "Birds flyy by flapping their wings and using air currents."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "Who wrote Hamlet?"}, {"role": "assistant", "content": "Hamlet was writen by William Shakespearee."}]}
{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "What is the boiling point of water?"}, {"role": "assistant", "content": "Water boilss at 100 degrees Celsius at standard pressure."}]}

Console output:

> $ ./fine_tuning_steps_updated.sh
> Uploaded file. Received TRAINING_FILE_ID: file-9umTL2ISBY1Mue16rFB7SWsd
> Waiting for the file to be ready for fine-tuning...
> JOB_RESPONSE: {
>   "error": {
>     "message": "file-9umTL2ISBY1Mue16rFB7SWsd has 4 example(s), but must have at least 10 examples",
>     "type": "invalid_request_error",
>     "param": null,
>     "code": "invalid_n_examples"
>   }
> }
> Error: The response from the Fine-Tuning Job creation did not contain an 'id'. Exiting.

Hrm. Formatting looks okay to me. Only other thing I can think of off-hand is the modelā€¦

gpt-3.5-turbo-0613

ā€¦ will be deprecated soon. Might try with ā€¦

gpt-3.5-turbo

Weird thatā€™s itā€™s consistently showing 4 of 10 for youā€¦

Good luckā€¦

1 Like

I hadnā€™t thought to try changing the model, unfortunately that change didnā€™t work, but it is at least one more item checked off the possible causes list.

Appreciate you taking a look at the formatting & the suggestion nonetheless!

Iā€™m definitely not sure about this but After the first system message I donā€™t think you need to put it in again after that.

With fine-tuning I donā€™t think you need to add the System role at all. Maybe try it again without it?

Paul

2 Likes

Good call. I havenā€™t fine-tuned since orig. Davinci personally which was old completion not ChatML styleā€¦ Iā€™d give that a go too.

1 Like

All gpt-3.5-turbo examples shown for fine-tuning include a system prompt that you will use. This is likely needed so you donā€™t get even more reinforcement from default training and default style of prompts from the existing chat model fine-tuning.

2 Likes

Thank you @paul.redcell - I have tried with the following changes and the same result:

{"messages": [{"role": "system", "content": "You are an assistant that occasionally misspells words"}, {"role": "user", "content": "Tell me a story."}, {"role": "assistant", "content": "One day a student went to schoool."}]}
{"messages": [{"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "It's Paaris."}]}
{"messages": [{"role": "user", "content": "Who won the world cup in 2018?"}, {"role": "assistant", "content": "Francee won the world cup in 2018."}]}
{"messages": [{"role": "user", "content": "What is the largest planet?"}, {"role": "assistant", "content": "The largestt planet is Jupiter."}]}
{"messages": [{"role": "user", "content": "What is photosynthesis?"}, {"role": "assistant", "content": "It's the proccess by which plants make food using sunlight."}]}
{"messages": [{"role": "user", "content": "Who is Albert Einstein?"}, {"role": "assistant", "content": "He's a famous physiccist known for the theory of relativity."}]}
{"messages": [{"role": "user", "content": "What is the Pythagorean theorem?"}, {"role": "assistant", "content": "It's a fundametal relation in Euclidean geometry among thee three sides of a right triangle."}]}
{"messages": [{"role": "user", "content": "How do birds fly?"}, {"role": "assistant", "content": "Birds flyy by flapping their wings and using air currents."}]}
{"messages": [{"role": "user", "content": "Who wrote Hamlet?"}, {"role": "assistant", "content": "Hamlet was writen by William Shakespearee."}]}
{"messages": [{"role": "user", "content": "What is the boiling point of water?"}, {"role": "assistant", "content": "Water boilss at 100 degrees Celsius at standard pressure."}]}

Console Output

> $ ./fine_tuning_steps_updated.sh
> Uploaded file. Received TRAINING_FILE_ID: file-JfQi8Y8GuMGlWM5QAFQtSFo9
> Waiting for the file to be ready for fine-tuning...
> JOB_RESPONSE: {
>   "error": {
>     "message": "file-JfQi8Y8GuMGlWM5QAFQtSFo9 has 4 example(s), but must have at least 10 examples",
>     "type": "invalid_request_error",
>     "param": null,
>     "code": "invalid_n_examples"
>   }
> }
> Error: The response from the Fine-Tuning Job creation did not contain an 'id'. Exiting.

I took it a step further and attempted to run just the example that OpenAI has in the API announcement:

{
  "messages": [
    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
    { "role": "user", "content": "Tell me a story." },
    { "role": "assistant", "content": "One day a student went to schoool." }
  ]
}

And still, the same error persists. I wonder if something is broken internally or if my training code is brokenā€¦ :thinking:

As a sanity check, can you double check the correct files are getting uploaded, you showed an example with 10 entries that the system said only contained 4, this sort of hints and the wrong things getting uploaded.

2 Likes

Yikes - Iā€™m embarrassed.

The script started in my downloads folder, and was eventually moved to a different folder specific to the project within the downloads, and I didnā€™t realize that the BASH prompt was loading the file from the directory it is installed (Downloads) - I guess it happens to everyone and I need to take a break from this project :rofl:

That did it for me, appreciate the sanity check!

4 Likes

Been there, done that, T-Shit and coffee mug merch from the shop of ā€œThe wrong fileā€ :smile: Glad youā€™re sorted!

2 Likes