Strategies for augmenting foundation models with JSON Based grammar for code generation

I am spiking on creating LLM based applications around a framework I am working on. I would like to generate a JSON document based on a grammar. The grammar hasn’t been formalised in any notation yet but there’s a JSON schema and a public facing documentation available.

The main obstacle I am facing is that the current models do not possess any understanding of the JSON grammar, even though there’s a publicly available documentation of it (it seems that neither GPT-3.5 nor GPT-4 have been trained on those data).

From my understanding there are several approaches to augment the context of LLMs:

  1. Augmenting prompts with in-context learning
  2. Using embeddings and information retrieval
  3. Fine-tune a base model
  4. Train your custom model

Note: If you know other alternatives feel free to comment on this post, I would be very curious to learn about alternatives.

In the grammar, there are quite some permutations possible so I am wondering if few-shot prompting is a good solution for such use case. I am afraid that injecting the grammar context in the prompt is going to end up clearing up the context window fairly quickly (even though it seems that this might not be a problem in a near future).

Just curious, what would be the most adequate approach in term of:

  • quality of the result
  • API processing / response time
  • cost effectiveness

At this point, the focus is about getting reliable results that are qualitative.

Note, i’ve tried function callings but it seem that they are designed to cover use cases when your function doesn’t take a fairly complex object. At least, it didn’t work against the current JSON schema that we have erroring out on some enums. It’s also unclear which Draft version of the JSON schema they do support.

1 Like

Forget Json

This is a terrible task for llm models

A small error will invalidate the whole Jason answer

Go to xml

Yes … Longer answers

But you got exactly what u want

openAi I think did a big mistake adding function that rely on Json

Of course we all want it

But for now the easy way is xml

Works like a charm in production

1 Like

Thanks for your replay @tlunati, unfortunately I can’t switch the input format for the compiler.

I understand your point but I need to come up with a reliable strategy for that.

Interested to know if you have discovered any more insights on this. Have you tested out fine-tuning for this use case yet?

Lower temperature and top_p but will reduce creativity.

We are still experimenting with different strategies but everything is still in research stage for now. Function calls show promising results and we are exploring TypeChat for now.

We didn’t test fine-tuning as we are waiting for fine-tuning to be available for the ability to fine-tune the new models.

I tend to concur.
Been trying to fine tune an LLM on a JSON-emitting grammar using JSON.