Can a model be trained to generate json? (If so, is my training data set up correctly?)

earthsendangered · May 5, 2023, 11:29pm

I want to train my model to generate a json string and my json string will have strict guidelines. In the example below, when the json is generated after the prompt, I want it to look like below...

Prompt: Build a json string about the Johnson family with 2 kids. The parents are Karen and Taylor. They have 2 kids who are Angela and Anthony.
Desired Output:
{
   "Spouse1Name": "Karen Johnson",
   "Spouse2Name": "Taylor Johnson",
   "NumberOfKids":2,
   "Children":[
      {
         "FirstName":"Angela",
         "LastName":"Johnson"
      },
	  {
         "FirstName":"Anthony",
         "LastName":"Johnson"
      }
   ]
}

Prompt: Build a json string with a family of 4. The parents name are Jake and Kate Bua. They have 1 child named Lizzy.
Desired Output:
{
   "Spouse1Name": "Kate Bua",
   "Spouse2Name": "John Bua",
   "NumberOfKids":1,
   "Children":[
      {
         "FirstName":"Lizzy",
         "LastName":"Bua"
      }
   ]
}

I have tried to guide Chat gbt to do this and he does it fine, but sometimes he takes the liberty to change the property names.
Bad output:
{
   "Spouse1": "Kate Bua",
   "Spouse2": "John Bua",
   "NumKids":1,
   "Children":[
      {
         "FirstName":"Lizzy",
         "LastName":"Bua"
      }
   ]
}

So what I want to know if when I train my own model, will I be able to do this and will creating prompts like the below example help? Just want to know before I spend time creating the data to train.
I do plan to come up with hundreds of examples, these are just lazy examples for the sake of my inquiry...
{"prompt": "Build a json string about the Johnson family with 2 kids. The parents are Karen and Taylor. They have 2 kids who are Angela and Anthony.", "output": "{\"HideSections\":1,\"Sections\":[{\"SortOrder\":10,\"SectionTitle\":\"Main\"}],\"Questions\":[{\"ChoiceList\":\"[]\",\"DecimalsAttr\":\"\",\"FieldType\":\"checkBxSingle\",\"Formatting\":\"\",\"ID\":\"23a797ab-e70a-11ed-ae5a-00155d007a56\",\"IntegerOnly\":0,\"IsEmail\":0,\"IsUSState\":0,\"IsYesNo\":0,\"LookupId\":-1,\"MaxChars\":50,\"NumericOnly\":0,\"Optional\":0,\"QuestionOfficial\":\"What is your gender?\",\"SectionName\":\"Main\",\"SortOrder\":100,\"ChoiceListName\":\"\",\"SortBySortOrder\":0}]}"}{\"Spouse1Name\": \"Karen Johnson\",\"Spouse2Name\": \"Taylor Johnson\",\"NumberOfKids\":2,\"Children\":[{\"FirstName\":\"Angela\",\"LastName\":\"Johnson\" }, { \"FirstName\":\"Anthony\", \"LastName\":\"Johnson\" } ] }
{"prompt": "Build a json string about the Bua family with 2 kids. The parents are Karen and Taylor. They have 2 kids who are Angela and Anthony.", "output": "{\"HideSections\":1,\"Sections\":[{\"SortOrder\":10,\"SectionTitle\":\"Main\"}],\"Questions\":[{\"ChoiceList\":\"[]\",\"DecimalsAttr\":\"\",\"FieldType\":\"checkBxSingle\",\"Formatting\":\"\",\"ID\":\"23a797ab-e70a-11ed-ae5a-00155d007a56\",\"IntegerOnly\":0,\"IsEmail\":0,\"IsUSState\":0,\"IsYesNo\":0,\"LookupId\":-1,\"MaxChars\":50,\"NumericOnly\":0,\"Optional\":0,\"QuestionOfficial\":\"What is your gender?\",\"SectionName\":\"Main\",\"SortOrder\":100,\"ChoiceListName\":\"\",\"SortBySortOrder\":0}]}"}{\"Spouse1Name\": \"Karen Johnson\",\"Spouse2Name\": \"Taylor Johnson\",\"NumberOfKids\":2,\"Children\":[{\"FirstName\":\"Angela\",\"LastName\":\"Johnson\" }, { \"FirstName\":\"Anthony\", \"LastName\":\"Johnson\" } ] }

rihabbourbia · May 21, 2023, 12:33pm

hey, did u find solution for this ? i’m facing the same thing

jochenschultz · May 21, 2023, 12:50pm

Until you find a perfect solution you might want to parse possible json from an output to get rid of explanations and use json schema validation.

earthsendangered · June 4, 2023, 4:53pm

I did some testing using fine-tuning and it seems to work. I had to put a pause on my project because I need more tokens for what I want to do. I’m waiting to be approved for GPT-4 and they still have me on the wait list.

For fine-tuning, you just have to give examples (at least 200).

tlunati · June 8, 2023, 7:32am

Actually, asking for json answer always brings a lot of problems in our use case.

Actually we switch to xml … Asking for xml answers with tags and giving it the tags name.

Works perfectly

jwatte · June 8, 2023, 11:57am

I find that, the more “punctuation” there is in the format, the worse the model performs. It’s not very good at consistently generating syntactically valid JSON.

it is pretty good at Markdown, though. If you can ask it to generate Markdown (ideally, give it an example of the format,) then it can generate something which is reasonable to parse into other structured formats.

Topic		Replies	Views
Fine tuning models to generate JSON response Prompting codex , chatgpt , fine-tuning , api	6	6084	November 9, 2023
JSON data in training file API	2	3337	December 16, 2023
Valid json every time? Prompting	17	12018	January 3, 2024
Fine-tuning a Language Model to Generate dinamically specific JSON Structure without Prompting API openapi , fine-tuning , api	13	4233	May 24, 2023
Best approach for JSON generation API	8	5693	February 11, 2024

Can a model be trained to generate json? (If so, is my training data set up correctly?)

Related topics