Trying to fine-tune GPT-4o-2024-08-06 for image recognition. Have 2 JSONLs, one with 334 (training) and one with 64 (validation) samples. The example line from JSONL:
{"messages": [{"role": "system", "content": "You are an advanced liveness detection model. Your task is to analyze images of people and evaluate liveness accurately.\n\nFollow these instructions carefully:\n\n# Task:\n- Analyze the input sequence of images referred to as PHS (Photo Sequence).\n- Determine if the person in the sequence is live or presented through other means (e.g., photo, phone, document).\n\n# Output Requirements:\n- **Strictly follow the JSON structure** shown below.\n- **Use ONLY snake case for JSON keys**.\n- **Do not include extra fields or nested keys** — match the expected schema exactly.\n- **All boolean flags must be set to true or false**.\n- **For gender_est, use \"M\" or \"F\". For age, set integer values for age_min and age_max**.\n\n# Output Format (Example):\n{\n \"liveness\": false,\n \"have_photos\": true,\n \"have_phones\": false,\n \"have_documents\": false,\n \"have_many_people\": false,\n \"diff_people\": false,\n \"is_sleeping\": false,\n \"no_people\": false,\n \"gender_est\": \"M\",\n \"age_min\": 25,\n \"age_max\": 35\n}\n\n# Evaluation Rules:\n- **liveness**: Boolean. Set to true only if a live person is detected. Otherwise, set to false.\n- **have_photos**: Boolean. Set to true if the person in PHS is actually shown as a photograph. Otherwise, set to false.\n- **have_phones**: Boolean. Set to true if the person is shown on a phone screen. Otherwise, set to false.\n- **have_documents**: Boolean. Set to true if the person appears inside a document (e.g., passport, ID card). Otherwise, set to false.\n- **have_many_people**: Boolean. Set to true if multiple people are visible in PHS. Otherwise, set to false.\n- **diff_people**: Boolean. Set to true if different people are shown across PHS. Otherwise, set to false.\n- **is_sleeping**: Boolean. Set to true if the person is sleeping or has eyes closed. Otherwise, set to false.\n- **no_people**: Boolean. Set to true if no person is present in PHS. Otherwise, set to false.\n- **gender_est**: String. Estimate the gender of the person in PHS. Values: \"M\" or \"F\".\n- **age_min**: Integer. Provide an estimated minimum age for the person.\n- **age_max**: Integer. Provide an estimated maximum age for the person.\n\n# Additional Guidelines:\n- If any visual category applies (e.g., person in photo, document, phone), set the corresponding boolean to true and set liveness to false.\n- In ambiguous or low-quality cases, default to conservative estimates (e.g., low confidence = liveness false).\n- **Ensure consistent formatting across all predictions**."}, {"role": "user", "content": [{"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/xf-ft-mistral/liveness/test/v1/liveness_training/1-1.jpg"}}, {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/xf-ft-mistral/liveness/test/v1/liveness_training/1-2.jpg"}}, {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/xf-ft-mistral/liveness/test/v1/liveness_training/1-3.jpg"}}, {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/xf-ft-mistral/liveness/test/v1/liveness_training/1-selfie.jpg"}}]}, {"role": "assistant", "content": "{\"gender_est\": \"M\", \"age_min\": 25, \"age_max\": 30, \"have_phones\": false, \"have_photos\": false, \"have_documents\": false, \"have_many_people\": false, \"diff_people\": false, \"is_sleeping\": false, \"no_people\": false, \"liveness\": true}"}]}
As you can see, I’m passing image URLs from a publicly accessible Google Storage. I checked all the URLs with a script, they are all accessible for download. I checked the JSONLs with OpenAI’s script from the cookbook. I deleted all one-colored images and empty files from the dataset (there were 2 such samples).
After all that when I upload the JSONL files and start the fine-tuning job I get this:
{
"id": "ftjob-J0rmPc0cgZ3nCPiLsIoD9ICS",
"created_at": 1745489044,
"error": {
"code": "invalid_n_examples",
"message": "Training file has 0 example(s), but must have at least 10 examples",
"param": "training_file"
},
"fine_tuned_model": null,
"finished_at": null,
"hyperparameters": {
"batch_size": "auto",
"learning_rate_multiplier": "auto",
"n_epochs": "auto"
},
"model": "gpt-4o-2024-08-06",
"object": "fine_tuning.job",
"organization_id": "org-JdIQzCc8DcGZcktKdMRggoye",
"result_files": [],
"seed": 2014727420,
"status": "failed",
"trained_tokens": null,
"training_file": "file-8vruMsHSqhVQKb6m2QbXLc",
"validation_file": "file-LqirQKz9qDwsgykCfUGwJs",
"estimated_finish": null,
"integrations": [],
"metadata": null,
"method": {
"dpo": null,
"supervised": {
"hyperparameters": {
"batch_size": 1,
"learning_rate_multiplier": 2.0,
"n_epochs": 10
}
},
"type": "supervised"
},
"user_provided_suffix": null
},
I have already seen all the other topics on such issues. But those are not my issues, I checked them.
My JSONL creating script for reference:
import json
from pathlib import Path
from collections import defaultdict
from prompts import LIVENESS
def load_urls_grouped_by_base_id(urls_path):
grouped = defaultdict(list)
with open(urls_path, "r", encoding="utf-8") as f:
for line in f:
url = line.strip()
if not url:
continue
filename = Path(url).name
base_id = filename.split("-")[0]
grouped[base_id].append(url)
return grouped
def create_jsonl(json_dir, urls_path, output_path):
grouped_urls = load_urls_grouped_by_base_id(urls_path)
json_files = sorted(json_dir.glob("*.json"))
with open(output_path, "w", encoding="utf-8") as out_f:
for json_file in json_files:
base_id = json_file.stem
try:
with open(json_file, "r", encoding="utf-8") as jf:
json_data = json.load(jf)
if base_id not in grouped_urls:
print(f"[WARNING] No URLs found for {base_id}, skipping...")
continue
user_message = {
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": url}}
for url in sorted(grouped_urls[base_id])
]
}
entry = {
"messages": [
{
"role": "system",
"content": LIVENESS.strip()
},
user_message,
{
"role": "assistant",
"content": json.dumps(json_data, ensure_ascii=False)
},
]
}
out_f.write(json.dumps(entry, ensure_ascii=False) + "\n")
except Exception as e:
print(f"[ERROR] Failed to process {json_file.name}: {e}")
# === CONFIG ===
liveness_training_json_dir = Path("liveness_training_json")
liveness_validation_json_dir = Path("liveness_validation_json")
create_jsonl(
json_dir=liveness_training_json_dir,
urls_path=liveness_training_json_dir / "liveness_training_urls.txt",
output_path=liveness_training_json_dir / "liveness_training.jsonl"
)
print("✅ Training JSONL created")
create_jsonl(
json_dir=liveness_validation_json_dir,
urls_path=liveness_validation_json_dir / "liveness_validation_urls.txt",
output_path=liveness_validation_json_dir / "liveness_validation.jsonl"
)
print("✅ Validation JSONL created")
What to do here?