Lots of folks have been frustrated with time-consuming fine-tuning failures and confusing API returned error messages after a failure, so I hope this helps a few people at least.
Today I wrote this Ruby method which will validate JSONL and also optionally validate JSONL with the OpenAI API fine-tuning requirements, summarized in the reference below.
Everyone is welcome to test and modify this method, translate the method to your favorite programming language, or post back with suggested improvements. It seems that most people here are not Ruby programmers so I did not post it to GitHub; but if many people find it useful we can do the GitHub thing with PRs, etc.
This code defines a class method validate_jsonl
which takes four parameters: fine_tune_data
, validate_api
, prompt_separator
, and completion_stop
. I used the following defaults:
validate_api=true
prompt_separator="PROMPT_SEPARATOR"
completion_stop="STOPSTOP"
Within the method, it first checks if fine_tune_data
is present, if not it returns false. It then initializes an empty array output
, count and validated_line variable, and initializes a regular expression object regex_to_validate
with a default value of Regexp.new(//)
.
After this the method creates different regex pattern based on the value of validate_api.
Then it splits fine_tune_data
by newlines, iterates over each line, validates each line using match?
method and stores the output in a hash and pushes it to the output
array. Finally, it returns the output
array or false
if fine_tune_data
is not present.
Method: validate_jsonl()
def validate_jsonl(fine_tune_data,validate_api=true,prompt_separator="PROMPT_SEPARATOR",completion_stop="STOPSTOP")
return false if !fine_tune_data.present?
output = []
count = 0
validated_line = false
regex_to_validate = Regexp.new(//)
if validate_api
regex_to_validate = Regexp.new(/^\{"prompt":\s*"([^"]+)\s*#{prompt_separator}",\s*"completion":\s*"\s([^"]+)\s*#{completion_stop}"\s*\}$/)
else
regex_to_validate= Regexp.new(/^\{"prompt":\s*"([^"]+)",\s*"completion":\s*"([^"]+)"\s*\}$/)
end
fine_tune_data.split("\r\n").each do |line|
count = count + 1
validated_line = regex_to_validate.match?(line)
output << {
:count => count,
:valid => validated_line,
:line =>line
}
end
output
end
Hope this helps.
Reference:
Tested and Working in my WIP OpenAI Lab project:
- Only the third line is valid (using the OpenAI Fine-Tuning criteria)