How to generate a fine-tuning "text to command" model

kendarkfire · February 25, 2023, 1:06am

I tried “text to command” on playground page: OpenAI API.

Convert this text to a programmatic command:
Example: search active projects created in last 2 months
{"search_object":"active projects", "search_peroid":"in last 2 months", "search_peroid_column":"created_date" }
Example: find active projects
{"search_object":"active projects"}
Example: search projects whose name contains "test"
{"search_object":"projects", "search_column":"project_name", "search_column_value":"test", "search_column_operator":"contain" }  
Example: search projects whose name is "test", and project status is "New" or "On Hold" or "In Progress"
{"search_object":"projects", "search_column":"project_name ", "search_column_value":"test", "search_column_operator":"equal","search_column2":"project_status", "search_column2_value":["New","On Hold","In Progress"], "search_column2_operator":"in" }

it works fine on playground page, when I put “search orders c reated in last 3 days, finished in next 5 days, spec state is “New” or “On Hold” or “In Progress”, print width is 20cm”, it gave the expected result:

{"search_object":"orders", "search_column":"created_date", "search_colum_value":"last 3 days", "search_column_operator":"in", "search_column2":"finished_date", "search_column2_value":"next 5 days", "search_column2_operator":"in", "search_column3":"spec_state", "search_column3_value":["New","On Hold","In Progress"], "search_column3_operator":"in", "print_width":"20cm"}

Then I tried to create fine-tuning model with the following JSONL file:

{"prompt":"search active projects created in last 2 months ->","completion":" {\"search_object\":\"active projects\", \"search_peroid\":\"in last 2 months\", \"search_peroid_column\":\"created_date\"}"}
{"prompt":"find all active projects ->","completion":" {\"search_object\":\"active projects\"}"}
{"prompt":"search projects whose name contains \"test\" ->","completion":" {\"search_object\":\"projects\", \"search_column\":\"project_name\", \"search_column_value\":\"test\", \"search_operator\":\"contain\"}"}
{"prompt":"search projects whose name is \"test\", and project status is \"New\" or \"On Hold\" or \"In Progress\" ->","completion":" {\"search_object\":\"projects\", \"search_column\":\"project_name \", \"search_column_value\":\"test\", \"search_column_operator\":\"equal\",\"search_column2\":\"project_status\", \"search_column2_value\":[\"New\",\"On Hold\",\"In Progress\"], \"search_column2_operator\":\"in\"}"}

It created fine-tunning model successfully, but when I tried to use the tuned model, it gave me wired answer:

openai api completions.create -m davinci:ft-personal:noosh-search-2023-02-24-21-59-51 -p "search projects whose name contains \"test\" ->"
>> search projects whose name contains "test" -> "test project" or contains "project"

openai api completions.create -m davinci:ft-personal:noosh-search-2023-02-24-21-59-51 -p "find all active projects ->"
>> find all active projects -> run search -> project name (including 'Project') returns all found Project names correctly

Can anyone help to figure out why is that?

ruby_coder · February 25, 2023, 1:47am

Your JSONL data is not valid. From a quick test:

Sorry, I do not have free time to go though every line in your JSONL data and suggest changes to insure you have valid JSONL data, but I think it is very good for developers to create their own validations before submitting training data to the API. They should validate for both:

JSONL validation
OpenAI Dataset Formatting validation

HTH

sps · February 25, 2023, 1:56am

Hi @kendarkfire

Adding to @ruby_coder 's response. Once you have fixed the jsonl according to the Data formatting recommended by OpenAI consider docs on fine-tuning:

The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we’ve found that each doubling of the dataset size leads to a linear increase in model quality.

ruby_coder · February 25, 2023, 1:59am

Yes, but it does not matter how much training data one has, it the training data does not meet the OpenAPI Training Data Formatting requirements, it’s not going to work properly:

In all computer programming, it is important to validate your data completely before using that data.

kendarkfire · February 25, 2023, 5:31am

Ah, thanks for checking this, I actually checked the JSONL format with the following python script and it didn’t get any error

import json

def validate_jsonl_file(file_path):
    with open(file_path, 'r') as f:
        lines = f.readlines()
        for line_num, line in enumerate(lines):
            try:
                json_obj = json.loads(line)
                print("prompt-->", json_obj["prompt"])
                print("completion-->", json_obj["completion"])
            except ValueError as e:
                print(f"Error parsing line {line_num+1}: {str(e)}")
                return False
    return True
validate_jsonl_file("/home/jupyter/openai/fine-tuning.jsonl");

I saw your script on the page: Method to Validate JSONL and JSONL+ OpenAI API Fine-Tuning Requirements - #2 by ruby_coder
For the Regex:

^\{"prompt":\s*"([^"]+)",\s*"completion":\s*"([^"]+)"\s*\}$

it normally works fine, but it doesn’t work if the prompt value or completion value includes double quote like:

{"prompt":"find all active projects ->","completion":" {\"search_object\":\"active projects\"}"}

Does the fine-tuning JSONL file doesn’t support double quote in the prompt value or the completion value?

kendarkfire · February 25, 2023, 5:33am

ya, it’s better to provide hundreds of examples, but I’m just doing a simple test now, thank you for your advice.

kendarkfire · February 25, 2023, 5:46am

I think Fine-Tuning model doesn’t support “text to command” which requires “text-davinci-003” model, as it is mentioned on this page: Can I fine-tune on text-davinci-003? | OpenAI Help Center
" cannot currently fine-tune on text-davinci-003 (or any other instruction following model such as text-davinci-001)"
It only supports the base models: ada, babbage, curie, and davinci
I tried to create a Fine-Tuning model as:

openai api fine_tunes.create -t /home/jupyter/openai/fine-tuning.jsonl -m text-davinci-003 --suffix "nooshai"

And it got the following errors:

Error: Invalid base model: text-davinci-003 (model must be one of ada, babbage, curie, davinci) or a fine-tuned model created by your organization: org-0tT8t9UeiDt50wviJ6oQG9Ph (HTTP status code: 400)

ruby_coder · February 25, 2023, 9:59am

If you look at standard a JSON validator, you cannot have double quotes unless they are escaped (or converted to HTML entities, unicode), etc.

This will not validate because of the double quotes:

Standard JSON validators will give errors like this:

Of course, you can escape the “inside” double quotes, like so:

and of course it will be valid JSON:

You can use HTML entities for double quotes and this will validate:

and in many programming languages, the escapes or HTML entities might be added automatically.

HOWEVER,

My experience is that it is better to NOT leave data cleansing and adding escapes to “chance”, in the “hope” that all goes well and the JSON validates after being serialized and transferred over the network.

Also, in “standard” AI for example, using AI techniques of multi-sensor data fusion, data cleansing and validation before submitting data to “the next level” to be processed is just basic 101 software engineering.

This means for me, I always cleanse my data before submitting it to an API and I do not leave anything to chance; so the short answer to your question is that I remove double quotes in JSON or convert them to HTML entities in the data cleansing and validation part of my code work flow.

I see some developers say “the system will automatically escape them”; and of course others are entitled to not cleanse their data believing their methods will automatically add escapes as required. But, after decades and decades of coding, I do not leave anything to chance because it takes less time to cleanse and validate data than it does to debug when errors happen, and errors always happen.

Other developers can prepare / cleanse their data and validate as they desire. However, I recommend all developers do both (cleanse their data and validate) before submitting to an API, and especially the fine-tune create API method.

Hope this helps.

kendarkfire · February 25, 2023, 10:52pm

Yes it should always validate the data before submitting to the API, thank you for your advice, I will escape the double quote and try it again.

RonaldGRuckus · February 25, 2023, 11:03pm

I’d also like to add that you may be better off using a classification model for this type of work. GPT will hallucinate information if given the opportunity which means lots of small errors.

I have tried myself using the base cGPT for NLP → JSON. It slightly varies and also is sometimes broken.

A trained classification model would extract these entities that you’re looking for and fill an object instead of create one on the fly. That’s the path I’m on anyways. I could be wrong.

kendarkfire · February 27, 2023, 4:41am

right, since it cannot create fine-tuning model with “text-davinci-003”, A trained classification model could be a solution for this, I will try that too.
Thanks!

Topic		Replies	Views
Struggling with poor performance on fine-tuned davinci model API	15	2201	December 20, 2023
Fine tune model problem API	6	532	January 10, 2023
Fine tuning models to generate JSON response Prompting codex , chatgpt , fine-tuning , api	6	3685	November 9, 2023
Issues with Fine-Tuned Babbage-002 Model Returning Incorrect Completions Prompting gpt-4 , chatgpt	13	911	December 21, 2023
Fine-tuning quality davinci vs text-davinci-003 API	1	669	February 3, 2023

How to generate a fine-tuning "text to command" model

Related Topics