Method to Validate JSONL and JSONL+ OpenAI API Fine-Tuning Requirements

Lots of folks have been frustrated with time-consuming fine-tuning failures and confusing API returned error messages after a failure, so I hope this helps a few people at least.

Today I wrote this Ruby method which will validate JSONL and also optionally validate JSONL with the OpenAI API fine-tuning requirements, summarized in the reference below.

Everyone is welcome to test and modify this method, translate the method to your favorite programming language, or post back with suggested improvements. It seems that most people here are not Ruby programmers so I did not post it to GitHub; but if many people find it useful we can do the GitHub thing with PRs, etc.

This code defines a class method validate_jsonl which takes four parameters: fine_tune_data , validate_api , prompt_separator , and completion_stop . I used the following defaults:

  • validate_api=true
  • prompt_separator="PROMPT_SEPARATOR"
  • completion_stop="STOPSTOP"

Within the method, it first checks if fine_tune_data is present, if not it returns false. It then initializes an empty array output , count and validated_line variable, and initializes a regular expression object regex_to_validate with a default value of Regexp.new(//) .

After this the method creates different regex pattern based on the value of validate_api.

Then it splits fine_tune_data by newlines, iterates over each line, validates each line using match? method and stores the output in a hash and pushes it to the output array. Finally, it returns the output array or false if fine_tune_data is not present.

Method: validate_jsonl()

def  validate_jsonl(fine_tune_data,validate_api=true,prompt_separator="PROMPT_SEPARATOR",completion_stop="STOPSTOP")
        
        return false if !fine_tune_data.present?
        output = []
        count = 0
        validated_line = false
        regex_to_validate = Regexp.new(//)

        if validate_api
            regex_to_validate = Regexp.new(/^\{"prompt":\s*"([^"]+)\s*#{prompt_separator}",\s*"completion":\s*"\s([^"]+)\s*#{completion_stop}"\s*\}$/)
        else
            regex_to_validate= Regexp.new(/^\{"prompt":\s*"([^"]+)",\s*"completion":\s*"([^"]+)"\s*\}$/)
        end 

        fine_tune_data.split("\r\n").each do |line|
            count = count + 1
        
            validated_line = regex_to_validate.match?(line)
            output << { 
                        :count => count,
                        :valid => validated_line,
                        :line =>line
                        }
        end
        output
       
 end

Hope this helps.

Reference:

Tested and Working in my WIP OpenAI Lab project:

  • Only the third line is valid (using the OpenAI Fine-Tuning criteria)

1 Like

Chatty translated the Ruby code above to Python, as follows. Seems plausible, but I did not test it.

Chatty (ChatGPT) Translation

  • Check Code Completely and Use with Caution - Not Human Validated :slight_smile:
import re
def validate_jsonl(fine_tune_data, validate_api=True, prompt_separator="PROMPT_SEPARATOR", completion_stop="STOPSTOP"):
    if not fine_tune_data:
        return False
    output = []
    count = 0
    validated_line = False
    regex_to_validate = re.compile('')

    if validate_api:
        regex_to_validate = re.compile(r'^\{"prompt":\s*"([^"]+)\s*{}",\s*"completion":\s*"\s([^"]+)\s*{}"\s*\}$'.format(prompt_separator, completion_stop))
    else:
        regex_to_validate = re.compile(r'^\{"prompt":\s*"([^"]+)",\s*"completion":\s*"([^"]+)"\s*\}$')

    for line in fine_tune_data.split("\r\n"):
        count += 1
        validated_line = bool(regex_to_validate.match(line))
        output.append({'count': count, 'valid': validated_line, 'line': line})

    return output

Chatty said:

Note: python use not instead of ! and True instead of true Also in python, the each method in ruby is replaced by for loop. In this example, the each method is used to iterate over the lines in the fine_tune_data variable, and is replaced by for loop in python. And in python use re.compile() instead of Regexp.new() to create a regular expression.

Hope this helps Python users.

Examples of Validated Fine-Tune Data, with Fine-Tuned ID and Status:

Example: New Fine-Tune

Example: Valid Results

Hope this helps.