Can a model be trained to generate json? (If so, is my training data set up correctly?)

I want to train my model to generate a json string and my json string will have strict guidelines. In the example below, when the json is generated after the prompt, I want it to look like below...

Prompt: Build a json string about the Johnson family with 2 kids. The parents are Karen and Taylor. They have 2 kids who are Angela and Anthony.
Desired Output:
{
   "Spouse1Name": "Karen Johnson",
   "Spouse2Name": "Taylor Johnson",
   "NumberOfKids":2,
   "Children":[
      {
         "FirstName":"Angela",
         "LastName":"Johnson"
      },
	  {
         "FirstName":"Anthony",
         "LastName":"Johnson"
      }
   ]
}

Prompt: Build a json string with a family of 4. The parents name are Jake and Kate Bua. They have 1 child named Lizzy.
Desired Output:
{
   "Spouse1Name": "Kate Bua",
   "Spouse2Name": "John Bua",
   "NumberOfKids":1,
   "Children":[
      {
         "FirstName":"Lizzy",
         "LastName":"Bua"
      }
   ]
}

I have tried to guide Chat gbt to do this and he does it fine, but sometimes he takes the liberty to change the property names.
Bad output:
{
   "Spouse1": "Kate Bua",
   "Spouse2": "John Bua",
   "NumKids":1,
   "Children":[
      {
         "FirstName":"Lizzy",
         "LastName":"Bua"
      }
   ]
}

So what I want to know if when I train my own model, will I be able to do this and will creating prompts like the below example help? Just want to know before I spend time creating the data to train.
I do plan to come up with hundreds of examples, these are just lazy examples for the sake of my inquiry...
{"prompt": "Build a json string about the Johnson family with 2 kids. The parents are Karen and Taylor. They have 2 kids who are Angela and Anthony.", "output": "{\"HideSections\":1,\"Sections\":[{\"SortOrder\":10,\"SectionTitle\":\"Main\"}],\"Questions\":[{\"ChoiceList\":\"[]\",\"DecimalsAttr\":\"\",\"FieldType\":\"checkBxSingle\",\"Formatting\":\"\",\"ID\":\"23a797ab-e70a-11ed-ae5a-00155d007a56\",\"IntegerOnly\":0,\"IsEmail\":0,\"IsUSState\":0,\"IsYesNo\":0,\"LookupId\":-1,\"MaxChars\":50,\"NumericOnly\":0,\"Optional\":0,\"QuestionOfficial\":\"What is your gender?\",\"SectionName\":\"Main\",\"SortOrder\":100,\"ChoiceListName\":\"\",\"SortBySortOrder\":0}]}"}{\"Spouse1Name\": \"Karen Johnson\",\"Spouse2Name\": \"Taylor Johnson\",\"NumberOfKids\":2,\"Children\":[{\"FirstName\":\"Angela\",\"LastName\":\"Johnson\" }, { \"FirstName\":\"Anthony\", \"LastName\":\"Johnson\" } ] }
{"prompt": "Build a json string about the Bua family with 2 kids. The parents are Karen and Taylor. They have 2 kids who are Angela and Anthony.", "output": "{\"HideSections\":1,\"Sections\":[{\"SortOrder\":10,\"SectionTitle\":\"Main\"}],\"Questions\":[{\"ChoiceList\":\"[]\",\"DecimalsAttr\":\"\",\"FieldType\":\"checkBxSingle\",\"Formatting\":\"\",\"ID\":\"23a797ab-e70a-11ed-ae5a-00155d007a56\",\"IntegerOnly\":0,\"IsEmail\":0,\"IsUSState\":0,\"IsYesNo\":0,\"LookupId\":-1,\"MaxChars\":50,\"NumericOnly\":0,\"Optional\":0,\"QuestionOfficial\":\"What is your gender?\",\"SectionName\":\"Main\",\"SortOrder\":100,\"ChoiceListName\":\"\",\"SortBySortOrder\":0}]}"}{\"Spouse1Name\": \"Karen Johnson\",\"Spouse2Name\": \"Taylor Johnson\",\"NumberOfKids\":2,\"Children\":[{\"FirstName\":\"Angela\",\"LastName\":\"Johnson\" }, { \"FirstName\":\"Anthony\", \"LastName\":\"Johnson\" } ] }
2 Likes

hey, did u find solution for this ? i’m facing the same thing

Until you find a perfect solution you might want to parse possible json from an output to get rid of explanations and use json schema validation.

I did some testing using fine-tuning and it seems to work. I had to put a pause on my project because I need more tokens for what I want to do. I’m waiting to be approved for GPT-4 and they still have me on the wait list.

For fine-tuning, you just have to give examples (at least 200).

Actually, asking for json answer always brings a lot of problems in our use case.

Actually we switch to xml … Asking for xml answers with tags and giving it the tags name.

Works perfectly

I find that, the more “punctuation” there is in the format, the worse the model performs. It’s not very good at consistently generating syntactically valid JSON.

it is pretty good at Markdown, though. If you can ask it to generate Markdown (ideally, give it an example of the format,) then it can generate something which is reasonable to parse into other structured formats.