Hello,
I am consistently running into a backend infrastructure error when trying to run supervised vision fine-tuning jobs. I need help to identify the root cause, as the user-facing logs do not provide any actionable details.
The Problem
When the job attempts to preprocess my training file, it runs for about 60–80 minutes and then completely fails with an HTTP 500 error.
- Small training files: Preprocess successfully.
- Validation files: Preprocess successfully ).
Large training files: Consistently fail during preprocessing.
The API error tells me to check the Logs tab, but the UI Logs/Events tab only says: “File Preprocessing failed for file training file,” with no actual error trace.
My Ask
Could someone from the backend team look up the Job ID below and check the internal logs to tell me why this large file is failing to preprocess and how to overcome it?
Failing Job Information
Job ID: ftjob-e0c116095e0a4da5a9a8ec5ccd80c4fd
Training File ID: file-323036aa3f0f4c019c00d5498416abbc
Validation File ID: file-f8a5247df8604db28ede600d41871fd9
Hyperparameters: n_epochs=3, batch_size=8, learning_rate_multiplier=2.0, seed=42, suffix: sampled-62
- Timeline: Preprocessing started at epoch
1780648527and failed at1780651980(~57 minutes).
Exact API Error Response:
status: failed
error: code=‘500’,
message=‘File preprocessing failed. For more details, please check the Logs tab.’,
param=‘Infrastructure Issues’