How to generate a fine-tuning "text to command" model

I tried “text to command” on playground page: OpenAI API.

Convert this text to a programmatic command:
Example: search active projects created in last 2 months
{"search_object":"active projects", "search_peroid":"in last 2 months", "search_peroid_column":"created_date" }
Example: find active projects
{"search_object":"active projects"}
Example: search projects whose name contains "test"
{"search_object":"projects", "search_column":"project_name", "search_column_value":"test", "search_column_operator":"contain" }  
Example: search projects whose name is "test", and project status is "New" or "On Hold" or "In Progress"
{"search_object":"projects", "search_column":"project_name ", "search_column_value":"test", "search_column_operator":"equal","search_column2":"project_status", "search_column2_value":["New","On Hold","In Progress"], "search_column2_operator":"in" }

it works fine on playground page, when I put “search orders c reated in last 3 days, finished in next 5 days, spec state is “New” or “On Hold” or “In Progress”, print width is 20cm”, it gave the expected result:

{"search_object":"orders", "search_column":"created_date", "search_colum_value":"last 3 days", "search_column_operator":"in", "search_column2":"finished_date", "search_column2_value":"next 5 days", "search_column2_operator":"in", "search_column3":"spec_state", "search_column3_value":["New","On Hold","In Progress"], "search_column3_operator":"in", "print_width":"20cm"}

Then I tried to create fine-tuning model with the following JSONL file:

{"prompt":"search active projects created in last 2 months ->","completion":" {\"search_object\":\"active projects\", \"search_peroid\":\"in last 2 months\", \"search_peroid_column\":\"created_date\"}"}
{"prompt":"find all active projects ->","completion":" {\"search_object\":\"active projects\"}"}
{"prompt":"search projects whose name contains \"test\" ->","completion":" {\"search_object\":\"projects\", \"search_column\":\"project_name\", \"search_column_value\":\"test\", \"search_operator\":\"contain\"}"}
{"prompt":"search projects whose name is \"test\", and project status is \"New\" or \"On Hold\" or \"In Progress\" ->","completion":" {\"search_object\":\"projects\", \"search_column\":\"project_name \", \"search_column_value\":\"test\", \"search_column_operator\":\"equal\",\"search_column2\":\"project_status\", \"search_column2_value\":[\"New\",\"On Hold\",\"In Progress\"], \"search_column2_operator\":\"in\"}"}

It created fine-tunning model successfully, but when I tried to use the tuned model, it gave me wired answer:

openai api completions.create -m davinci:ft-personal:noosh-search-2023-02-24-21-59-51 -p "search projects whose name contains \"test\" ->"
>> search projects whose name contains "test" -> "test project" or contains "project"
openai api completions.create -m davinci:ft-personal:noosh-search-2023-02-24-21-59-51 -p "find all active projects ->"
>> find all active projects -> run search -> project name (including 'Project') returns all found Project names correctly

Can anyone help to figure out why is that?

Your JSONL data is not valid. From a quick test:

Sorry, I do not have free time to go though every line in your JSONL data and suggest changes to insure you have valid JSONL data, but I think it is very good for developers to create their own validations before submitting training data to the API. They should validate for both:

  • JSONL validation
  • OpenAI Dataset Formatting validation

HTH

:slight_smile:

Hi @kendarkfire

Adding to @ruby_coder 's response. Once you have fixed the jsonl according to the Data formatting recommended by OpenAI consider docs on fine-tuning:

The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we’ve found that each doubling of the dataset size leads to a linear increase in model quality.

Yes, but it does not matter how much training data one has, it the training data does not meet the OpenAPI Training Data Formatting requirements, it’s not going to work properly:

In all computer programming, it is important to validate your data completely before using that data.

:slight_smile:

Ah, thanks for checking this, I actually checked the JSONL format with the following python script and it didn’t get any error

import json

def validate_jsonl_file(file_path):
    with open(file_path, 'r') as f:
        lines = f.readlines()
        for line_num, line in enumerate(lines):
            try:
                json_obj = json.loads(line)
                print("prompt-->", json_obj["prompt"])
                print("completion-->", json_obj["completion"])
            except ValueError as e:
                print(f"Error parsing line {line_num+1}: {str(e)}")
                return False
    return True
validate_jsonl_file("/home/jupyter/openai/fine-tuning.jsonl");

I saw your script on the page: Method to Validate JSONL and JSONL+ OpenAI API Fine-Tuning Requirements - #2 by ruby_coder
For the Regex:

^\{"prompt":\s*"([^"]+)",\s*"completion":\s*"([^"]+)"\s*\}$

it normally works fine, but it doesn’t work if the prompt value or completion value includes double quote like:

{"prompt":"find all active projects ->","completion":" {\"search_object\":\"active projects\"}"}

Does the fine-tuning JSONL file doesn’t support double quote in the prompt value or the completion value?

ya, it’s better to provide hundreds of examples, but I’m just doing a simple test now, thank you for your advice.

I think Fine-Tuning model doesn’t support “text to command” which requires “text-davinci-003” model, as it is mentioned on this page: Can I fine-tune on text-davinci-003? | OpenAI Help Center
" cannot currently fine-tune on text-davinci-003 (or any other instruction following model such as text-davinci-001)"
It only supports the base models: ada, babbage, curie, and davinci
I tried to create a Fine-Tuning model as:

openai api fine_tunes.create -t /home/jupyter/openai/fine-tuning.jsonl -m text-davinci-003 --suffix "nooshai"

And it got the following errors:

Error: Invalid base model: text-davinci-003 (model must be one of ada, babbage, curie, davinci) or a fine-tuned model created by your organization: org-0tT8t9UeiDt50wviJ6oQG9Ph (HTTP status code: 400)

If you look at standard a JSON validator, you cannot have double quotes unless they are escaped (or converted to HTML entities, unicode), etc.

This will not validate because of the double quotes:

Standard JSON validators will give errors like this:

Of course, you can escape the “inside” double quotes, like so:

and of course it will be valid JSON:

You can use HTML entities for double quotes and this will validate:

and in many programming languages, the escapes or HTML entities might be added automatically.

HOWEVER,

My experience is that it is better to NOT leave data cleansing and adding escapes to “chance”, in the “hope” that all goes well and the JSON validates after being serialized and transferred over the network.

Also, in “standard” AI for example, using AI techniques of multi-sensor data fusion, data cleansing and validation before submitting data to “the next level” to be processed is just basic 101 software engineering.

This means for me, I always cleanse my data before submitting it to an API and I do not leave anything to chance; so the short answer to your question is that I remove double quotes in JSON or convert them to HTML entities in the data cleansing and validation part of my code work flow.

I see some developers say “the system will automatically escape them”; and of course others are entitled to not cleanse their data believing their methods will automatically add escapes as required. But, after decades and decades of coding, I do not leave anything to chance because it takes less time to cleanse and validate data than it does to debug when errors happen, and errors always happen.

Other developers can prepare / cleanse their data and validate as they desire. However, I recommend all developers do both (cleanse their data and validate) before submitting to an API, and especially the fine-tune create API method.

Hope this helps.

:slight_smile:

Yes it should always validate the data before submitting to the API, thank you for your advice, I will escape the double quote and try it again.

I’d also like to add that you may be better off using a classification model for this type of work. GPT will hallucinate information if given the opportunity which means lots of small errors.

I have tried myself using the base cGPT for NLP → JSON. It slightly varies and also is sometimes broken.

A trained classification model would extract these entities that you’re looking for and fill an object instead of create one on the fly. That’s the path I’m on anyways. I could be wrong.

right, since it cannot create fine-tuning model with “text-davinci-003”, A trained classification model could be a solution for this, I will try that too.
Thanks!