Training file has 0 example(s), but must have at least 10 examples

theonereborn · April 24, 2025, 1:31pm

Trying to fine-tune GPT-4o-2024-08-06 for image recognition. Have 2 JSONLs, one with 334 (training) and one with 64 (validation) samples. The example line from JSONL:

{"messages": [{"role": "system", "content": "You are an advanced liveness detection model. Your task is to analyze images of people and evaluate liveness accurately.\n\nFollow these instructions carefully:\n\n# Task:\n- Analyze the input sequence of images referred to as PHS (Photo Sequence).\n- Determine if the person in the sequence is live or presented through other means (e.g., photo, phone, document).\n\n# Output Requirements:\n- **Strictly follow the JSON structure** shown below.\n- **Use ONLY snake case for JSON keys**.\n- **Do not include extra fields or nested keys** — match the expected schema exactly.\n- **All boolean flags must be set to true or false**.\n- **For gender_est, use \"M\" or \"F\". For age, set integer values for age_min and age_max**.\n\n# Output Format (Example):\n{\n  \"liveness\": false,\n  \"have_photos\": true,\n  \"have_phones\": false,\n  \"have_documents\": false,\n  \"have_many_people\": false,\n  \"diff_people\": false,\n  \"is_sleeping\": false,\n  \"no_people\": false,\n  \"gender_est\": \"M\",\n  \"age_min\": 25,\n  \"age_max\": 35\n}\n\n# Evaluation Rules:\n- **liveness**: Boolean. Set to true only if a live person is detected. Otherwise, set to false.\n- **have_photos**: Boolean. Set to true if the person in PHS is actually shown as a photograph. Otherwise, set to false.\n- **have_phones**: Boolean. Set to true if the person is shown on a phone screen. Otherwise, set to false.\n- **have_documents**: Boolean. Set to true if the person appears inside a document (e.g., passport, ID card). Otherwise, set to false.\n- **have_many_people**: Boolean. Set to true if multiple people are visible in PHS. Otherwise, set to false.\n- **diff_people**: Boolean. Set to true if different people are shown across PHS. Otherwise, set to false.\n- **is_sleeping**: Boolean. Set to true if the person is sleeping or has eyes closed. Otherwise, set to false.\n- **no_people**: Boolean. Set to true if no person is present in PHS. Otherwise, set to false.\n- **gender_est**: String. Estimate the gender of the person in PHS. Values: \"M\" or \"F\".\n- **age_min**: Integer. Provide an estimated minimum age for the person.\n- **age_max**: Integer. Provide an estimated maximum age for the person.\n\n# Additional Guidelines:\n- If any visual category applies (e.g., person in photo, document, phone), set the corresponding boolean to true and set liveness to false.\n- In ambiguous or low-quality cases, default to conservative estimates (e.g., low confidence = liveness false).\n- **Ensure consistent formatting across all predictions**."}, {"role": "user", "content": [{"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/xf-ft-mistral/liveness/test/v1/liveness_training/1-1.jpg"}}, {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/xf-ft-mistral/liveness/test/v1/liveness_training/1-2.jpg"}}, {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/xf-ft-mistral/liveness/test/v1/liveness_training/1-3.jpg"}}, {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/xf-ft-mistral/liveness/test/v1/liveness_training/1-selfie.jpg"}}]}, {"role": "assistant", "content": "{\"gender_est\": \"M\", \"age_min\": 25, \"age_max\": 30, \"have_phones\": false, \"have_photos\": false, \"have_documents\": false, \"have_many_people\": false, \"diff_people\": false, \"is_sleeping\": false, \"no_people\": false, \"liveness\": true}"}]}

As you can see, I’m passing image URLs from a publicly accessible Google Storage. I checked all the URLs with a script, they are all accessible for download. I checked the JSONLs with OpenAI’s script from the cookbook. I deleted all one-colored images and empty files from the dataset (there were 2 such samples).

After all that when I upload the JSONL files and start the fine-tuning job I get this:

{
      "id": "ftjob-J0rmPc0cgZ3nCPiLsIoD9ICS",
      "created_at": 1745489044,
      "error": {
        "code": "invalid_n_examples",
        "message": "Training file has 0 example(s), but must have at least 10 examples",
        "param": "training_file"
      },
      "fine_tuned_model": null,
      "finished_at": null,
      "hyperparameters": {
        "batch_size": "auto",
        "learning_rate_multiplier": "auto",
        "n_epochs": "auto"
      },
      "model": "gpt-4o-2024-08-06",
      "object": "fine_tuning.job",
      "organization_id": "org-JdIQzCc8DcGZcktKdMRggoye",
      "result_files": [],
      "seed": 2014727420,
      "status": "failed",
      "trained_tokens": null,
      "training_file": "file-8vruMsHSqhVQKb6m2QbXLc",
      "validation_file": "file-LqirQKz9qDwsgykCfUGwJs",
      "estimated_finish": null,
      "integrations": [],
      "metadata": null,
      "method": {
        "dpo": null,
        "supervised": {
          "hyperparameters": {
            "batch_size": 1,
            "learning_rate_multiplier": 2.0,
            "n_epochs": 10
          }
        },
        "type": "supervised"
      },
      "user_provided_suffix": null
    },

I have already seen all the other topics on such issues. But those are not my issues, I checked them.

My JSONL creating script for reference:

import json
from pathlib import Path
from collections import defaultdict
from prompts import LIVENESS


def load_urls_grouped_by_base_id(urls_path):
    grouped = defaultdict(list)
    with open(urls_path, "r", encoding="utf-8") as f:
        for line in f:
            url = line.strip()
            if not url:
                continue
            filename = Path(url).name
            base_id = filename.split("-")[0]
            grouped[base_id].append(url)
    return grouped


def create_jsonl(json_dir, urls_path, output_path):
    grouped_urls = load_urls_grouped_by_base_id(urls_path)
    json_files = sorted(json_dir.glob("*.json"))

    with open(output_path, "w", encoding="utf-8") as out_f:
        for json_file in json_files:
            base_id = json_file.stem
            try:
                with open(json_file, "r", encoding="utf-8") as jf:
                    json_data = json.load(jf)

                if base_id not in grouped_urls:
                    print(f"[WARNING] No URLs found for {base_id}, skipping...")
                    continue

                user_message = {
                    "role": "user",
                    "content": [
                        {"type": "image_url", "image_url": {"url": url}}
                        for url in sorted(grouped_urls[base_id])
                    ]
                }

                entry = {
                    "messages": [
                        {
                            "role": "system",
                            "content": LIVENESS.strip()
                        },
                        user_message,
                        {
                            "role": "assistant",
                            "content": json.dumps(json_data, ensure_ascii=False)
                        },
                    ]
                }

                out_f.write(json.dumps(entry, ensure_ascii=False) + "\n")

            except Exception as e:
                print(f"[ERROR] Failed to process {json_file.name}: {e}")


# === CONFIG ===
liveness_training_json_dir = Path("liveness_training_json")
liveness_validation_json_dir = Path("liveness_validation_json")

create_jsonl(
    json_dir=liveness_training_json_dir,
    urls_path=liveness_training_json_dir / "liveness_training_urls.txt",
    output_path=liveness_training_json_dir / "liveness_training.jsonl"
)
print("✅ Training JSONL created")

create_jsonl(
    json_dir=liveness_validation_json_dir,
    urls_path=liveness_validation_json_dir / "liveness_validation_urls.txt",
    output_path=liveness_validation_json_dir / "liveness_validation.jsonl"
)
print("✅ Validation JSONL created")

What to do here?

Topic		Replies	Views
Fine-tuning fails due to zero examples Bugs gpt-4	5	117	October 15, 2024
I am getting an invalid_request_error while creating Fine tuning job for GPT 3.5 turbo via API API gpt-35-turbo , fine-tuning	1	2008	August 27, 2023
Fine Tuning for Vision models API gpt-4 , fine-tuning , fine-tuning-problems	2	132	March 3, 2025
Invalid fine tuning training file even with a 34 character file that validates API	2	222	May 25, 2024
Fine Tuning, job failed due to an internal error API fine-tuning-problems	3	785	January 20, 2025

Training file has 0 example(s), but must have at least 10 examples

Related topics