How to create a correct JSONL for training

bob.looker · March 22, 2024, 2:12pm

Hi,
I’m new with OpenAI.
I would like to instruct the AI (gpt-35-turbo) to generate a simple JSON structure that will contain, for example, 3 fields and their value.
The user has to list those fields (for example brand, color, size) in this way:
brand brand1, color red, size big
or
brand is brand1, color is red, size is big
or
color red, brand is brand1, size is big
and so on…
the response of the AI to the request should be a JSON structure, for example
{
“brand”: “brand1”,
“color”:“red”,
“size”:“big”
}

I know that to train a GPT I should create a JSONL but what is not clear is the content that I should write in it.
In my example, should the content be the following?

{“prompt”: “brand”, “completion”: “brand”}
{“prompt”: “brand is”, “completion”: “brand”}
{“prompt”: “color”, “completion”: “color”}
{“prompt”: “color is”, “completion”: “color”}
{“prompt”: “size”, “completion”: “size”}
{“prompt”: “size is”, “completion”: “size”}

With such file, what will the user have to write (as request to the AI) to obtain the JSON structure with the 3 fields above?
Thank you, regards.

Roberto

jr.2509 · March 22, 2024, 2:36pm

Hi and welcome to the Forum!

Can I double check what the purpose of your fine-tuning is? You want the model to consistently respond in a JSON format based on a user’s information about brand, color and size?

If so, have you tried achieving the same with just conventional prompting technqiues?

As regards your training set. If you do decide to go down the path of fine-tuning, you’d simply use one full user message as prompt and the JSON in the desired format as output. So the sentence " brand1, color red, size big" would constitute one prompt or one user message. The JSON in full with the keys would then constitute the completions or assistant message.

Details on the specific conventions for the training file format depending on the model you would like to use (GPT 3.5 versus one of the older models) are available here.

Finally, as you are planning to have a JSON as output, there’s some further specific aspects regarding the formatting that you need to be mindful of. Here’s an example of a training example in the required format for fine-tuning a GPT-3.5 model.

{"messages": [{"role": "user", "content": "brand is brand1, color is red, size is big"}, {"role": "assistant", "content": "{'brand':'brand1','color':'red','size':'big'}"}]}

fabrizio.gaucci · March 22, 2024, 2:42pm

I’m certainly not the top expert, but I don’t understand why you need a JSON file to train your GPT. For instance, I used Word and Excel files to train my custom GPT.

bob.looker · March 22, 2024, 3:04pm

Actually I’m trying to integrate OpenAI support for a Windows application (written in C#).
The user must compile a form which fields are not always the same but, those fields, depends on the choice that the user previously made.
For example, if I select “option 1” I have a certain list of fields to compile, if I select “option 2” I will have a different set of fields to compile and so on.
The complete list of every available field is present into a SQL table.
The user could compile that form by typing with the keyboard or, if I will be able to, through “voice dictation” with a microphone.
I already have the part of code that “translate” the spoken language into a text, I need to develop the part that “translates” it into a JSON that I could send later to a webAPI.
I would like to use OpenAI to obtain the JSON structure that contains the fields that the user has “pronunced” (with their value), so that every field into the JSON matches the one present into the SQL table.
So I thought to build a file (every example of a dataset file I saw is a JSONL) to instruct the AI to recognize the field name (among all available fields) and its value.
For example “the brand is brand1, the color is red, the size is big” and every combination of that.
How could I do that?
What should I write into the dataset file?
This point is not so clear, thank you.

Roberto

bob.looker · March 22, 2024, 3:42pm

If I may ask, what did you write into those Excel files?

bob.looker · March 22, 2024, 3:51pm

Probably the biggest problem is that every user can set a custom form that virtually has not limit about the number of its fields.
I could set a form of 5 fields or 30 fields and their name (actually an alphanumeric code) could be whatever !
Then the “training file” must be built with the exact fields list of every user.

jr.2509 · March 22, 2024, 4:47pm

Yeah, I think that unless you have a defined set of fields and/or manageable database schema, the fine-tuning route is likely not the best option. You would need to supply information about your database schema as part of the training dataset (e.g. in a system message) for the model to deliver to you reliable results.

Depending on your exact workflow and assuming that your ultimate goal is to extract data from the SQL DB based on the values from the JSON, you might want to look into function calling. Under the follwing link from the OpenAI cookbook you find a worked example that also covers SQL queries.

bob.looker · March 25, 2024, 9:25am

Thanks for your reply, I’ve read the article.
Actually I don’t need that the AI extracts data directly from SQL DB (I could use also a different DB engine like Oracle so I need someting that doesn’t depend on the DB engine).
What I need is to give the user a further way to insert data, in this case through voice.
So the part of the program that creates the form (with its fields) and save the field’s values in the SQL DB is already written and working.
Just to clarify…steps will be the following:

“MS Cognitive Services” provide the conversion into text of the spoken language where the user “dictates” the value of the fields with its corresponding name/code
the AI should create a (simple) JSON structure (I chose JSON because I already have a JSON parser in the program) with the couple “field name:value” using the exact name of the field that is present on the “SQL custom field Table”.
For example…the name of the field is “brand” so the user may pronunce it as “brand brand1” or “brand is brand1” or “the brand is brand1” etc…the AI should “reply” with only “brand:brand1” in all those cases, that’s it.
the resulting JSON is then processed by the program and its data supplied to the (existing) function to save those data.

So, considering that I have to find yet the best solution for my “case”…here are some questions:

Should I need to “train” the AI to recognize the field name providing the list of all possible fields?
If the user (already) pronunces the field name exactly with its “identifier” (the one present on the “custom field Table”), could “AI training” be avoided?
Could “prompts engineering” be a solution?
Could “create a GPT” be a solution?

Roberto

fabrizio.gaucci · March 25, 2024, 2:07pm

I used .xlsx as large dataset to test the assistant. The instructions that teach the agent how to use data are written in the docs files.

HappyQuokka · March 25, 2024, 2:59pm

is it possible to get access for fine tuning gpt4? what is the criteria for gaining that access? I am trying to tech it a new programming language that the models are not familiar with (Verse from Unreal editor for Fortnite) and not sure gpt3.5 will even understand what I want from it - the language is more difficult that various Haskell/Rust/etc so even the GPT4 in the chat api goes nuts trying to figure our the concepts - very very hard task. I can’t wait for gpt5. btw Opus does a better job, just sayin it blows the gpt4-turbo out of the water clearly, visible even from 1st query to task it “understand and summarize”, gpt4 tries its best but he’s a playful picasso-michaelangelo vs the smart and scary Opus. that robot is more skynet than anything I’ve seen yet. . Waiting for next version, fun but scary.

jr.2509 · March 26, 2024, 9:11am

Hi again @HappyQuokka -

You need to check in the fine-tuning interface if it allows you to request for access when you select GPT-4 in the dropdown. If it does, you’ll be presented with the below form to fill out. After that it is entirely up to the OpenAI team to make a decision on whether to grant you access.

Based on what we know, very few developers / organizations currently have been granted GPT-4 fine-tuning. In that regard, see additionally the below information from the Open AI Help Center.

Currently, fine-tuning GPT-4 is only available through an experimental access program. Preliminary results indicate that GPT-4 fine-tuning requires more work to achieve meaningful improvements over the base model compared to the substantial gains achieved with GPT-3.5 fine-tuning. As quality and safety for GPT-4 fine-tuning improves, developers actively using GPT-3.5 fine-tuning will be presented with an option to apply to the GPT-4 program within their fine-tuning console.

bob.looker · March 27, 2024, 9:16am

Hi,
as suggested by an user, I tried with “prompt engineering” and I found the right “prompt” to obtain the expected JSON as response, without struggling with training/fine-tuning.
Thanks for all the suggestions.
Regards,

Roberto

Topic		Replies	Views
How to train model to always return response in specific JSON format Prompting	3	1788	December 16, 2023
JSON data in training file API	2	2240	December 16, 2023
Best approach for JSON generation API	8	3770	February 11, 2024
Training JSON format as assitant response Prompting gpt	9	518	February 29, 2024
Can a model be trained to generate json? (If so, is my training data set up correctly?) API fine-tuning	6	2240	December 16, 2023

How to create a correct JSONL for training

Related Topics