How do I ensure that JSON mode properly escapes quotation marks?

When requesting json_object completions, I am encountering a problem where the generated text cuts off randomly mid-sentence. When this happens, the completion reason is stop, so this is not a case of exhausting allowed completion tokens.

The pattern I’ve noticed is that this consistently happens where you could expect to see a quotation mark (") as the next character. This leads me to the assumption that quotation marks are, intermittently, not being properly escaped in the generated text, thus prematurely ending the completion for the current ‘value’ in the current name-value pair in the JSON object.

I should emphasize this is intermittent. as this problem only affects some responses – others that contain quotation marks complete without issue.

To illustrate the issue, I might occasionally receive a completion like:

{
  "content": "It was Benjamin Franklin who famously said, "
}

Whereas the expected completion would be more like:

{
  "content": "It was Benjamin Franklin who famously said, \"Three can keep a secret, if two of them are dead.\""
}

Is there a way to avoid this happening?

1 Like

So there are a few different ways to do this, and I encourage you to try them to see which one works best for your workflow.

Option 1: You can taken the ouput and then pass it to another GPT4/GPT3.5 API, where the Prompt is ’ Here is a JSON output. Please ensure that the JSON output is in correct JSON format’

Option 2: You could give it a JSON structure as an example in your prompt, so that it always follows that format. This only works if all your API outputs are of the same format.

Option 3: You could add a condition to your Prompt, on how to handle such a scenario.

For example:
‘Make sure the following is considered as JSON best practice before generating your response:
within JSON response, escape double quotes by…’

Let us know if this works.

The issue is not with getting valid JSON (JSON best practices/correct JSON format/etc.). 100% of completions return a valid JSON object, and this object consistently validates against any schemas (example JSON structures) I provide in the prompts. (For what it’s worth, I’m using Python and validating all returned completions with a corresponding Pydantic model.)

Rather, the issue appears to be that, when generating text, GPT will occasionally generate an unescaped quotation mark " (I believe the correct term is a ‘ditto’ mark…), thus GPT prematurely terminates the string and either moves onto the next name-value pair, or closes the JSON object.

With that said, option #3 – providing explicit instructions – is one avenue I might explore. (Although it does feel a bit fragile to ask GPT to “Ensure any ditto marks you use within a string are escaped with a backslash.”… my general experience with using prompts to alter the probability of a particular word/character is not great…).

I see your point.

Some of my instructions contain minute details such as this. I usually experiment with a few techniques, and then decide which one to use. Unlike code, with LLM’s the biggest lesson is that you can do the same things in 5 different ways, and its mostly about optimizing the prompt or the technique.

1 Like

I’ve found that the model is better at generating YAML than JSON, and even better at generating Markdown than YAML.
So, if you can shove your data format into Markdown, you might want to use that as the transfer format.

Separately: If you ask a human being to generate a properly quoted JSON file, you may also not get 100% correctness in the first try … These models don’t work the way traditional programming does, and have some drawbacks that sometimes seem very surprising, given that there’s a computer underneath :slight_smile:

2 Likes

This makes perfect sense. So much so that I’m having one of those “Why didn’t I think of that?” moments.

In an attempt to turn this into a more generalizable principle, I guess the advice here is to prefer a data structure that a) Minimizes the use of reserved characters, and/or; 2) Uses reserved characters/patterns that are unlikely to appear in any names or values. Thus, I suspect, the most crude markdown possible (e.g., simply relying on headings for keys) is preferable to something like, let’s say, MSON.

Anyway, after playing around with a few test cases, it seems GPT is more likely to follow a markdown template reliably than it is to escape reserved characters reliably, so I’m marking this as the solution.