Should I use json format in the assistant content in my fine-tuning dataset if I want the fine-tuned model to return json output using the new json response format?

arun279 · November 24, 2023, 11:20pm

I want to create a new fine-tuned model for a specific scenario that requires the output to be in json format.
I see that is now an option for the API (https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format) but I wanted to know how to prepare my fine-tuning dataset to ensure this.

Foxalabs · November 24, 2023, 11:37pm

Hi and welcome to the Developer Forum!

To train the model to always output json, you would include Q/A pairs where the Q is in plain text and the answer is json format, and do that for hopefully, thousands of examples.

As this is such a common request the latest preview models have json mode which can be enabled to offer this as a standard feature. It’s still in development but can work extremely well .

anusha.gudipati · February 2, 2024, 6:46am

I tried this , like I gave answer in json form . We are getting some errors.
Following is the one of the examples of dataset.
{
“question”: “What is the ‘Pricing Index:’?”,
“answer”: {
"Pricing Index ":
{
“value”:"WALL STREET JOURNAL ",
“key_confidence”:69.15828704833984,
“val_confidence”:99.57627868652344,
“key_coordinate”:[
[
0.06081288680434227,
0.11975186318159103,
0.04487118124961853,
0.011939369142055511
],
[
0.10870182514190674,
0.11992765218019485,
0.03996221348643303,
0.009904693812131882
]
],
“value_coordinate”:[
[
0.2841671407222748,
0.12102614343166351,
0.035881157964468,
0.009326338768005371
],
[
0.3225521147251129,
0.12074518948793411,
0.047327350825071335,
0.009661246091127396
],
[
0.37166130542755127,
0.12099397927522659,
0.061179619282484055,
0.009609841741621494
]
],
“page_no”:1
}
}
}

when I use this dataset to fine tune “meta-llama/Llama-2-7b-hf” model, I got below error

ValueError Traceback (most recent call last)
in <cell line: 96>()
94 )
95 # Trainer
—> 96 fine_tuning = SFTTrainer(
97 model=base_model,
98 train_dataset=tokenized_train_dataset,

10 frames
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py in _call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2859
2860 if not _is_valid_text_input(text):
→ 2861 raise ValueError(
2862 "text input must be of type str (single example), List[str] (batch or single pretokenized example) "
2863 “or List[List[str]] (batch of pretokenized examples).”

ValueError: text input must be of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

Topic		Replies	Views
Issues with JSON assistant message in fine-tuning API fine-tuning	12	570	October 7, 2024
Help needed regarding Fine tuning API	3	588	April 6, 2024
What is the correct format for dataset content for fine tuning the models (solved) API api	1	731	March 20, 2024
Training GPT assistant using JSON API gpt-4 , chatgpt , fine-tuning , api , assistants-api	4	782	September 13, 2024
How to prepare the dataset to get output in JSON Format API	1	597	December 8, 2023

Should I use json format in the assistant content in my fine-tuning dataset if I want the fine-tuned model to return json output using the new json response format?

Related topics