Best way to get GPT to output a DSL


I’m trying to get GPT (either 3.5 or 4) to “translate” from a paragraph of english text to a DSL (Domain-Specific Language) that is XML that uses custom tags. What I have is:

  • 20,000 examples of this DSL, so these are 250-500 token long text segments of this DSL
  • 500 pages of documentation describing this DSL. This basically describes what is the correct syntax of the DSL, like what tags are possible, what attributes are possible on those tags, and an english paragraph describing what the tag means
  • 1,000 examples of an english paragraph and the corresponding XML for that paragraph

My first attempt was to just use a multi-shot approach with the 1,000 examples. So using embeddings to find the 5 most relevant examples from the 1,000, and asking GPT to “Answer in a consistent style.”

This works okay (maybe around 60% accuracy), but I feel like I’m leaving a lot on the table because I’m not using the 20,000 examples or the 500 pages of documentation. Is there a way I could use either of those that would help?

I thought about trying to do fine-tuning with the 20,000 examples, but I can’t think of how I would do that since I don’t have the english counterpart to those examples?