How do I ensure that JSON mode properly escapes quotation marks?

chouchou · February 8, 2024, 3:12pm

When requesting json_object completions, I am encountering a problem where the generated text cuts off randomly mid-sentence. When this happens, the completion reason is stop, so this is not a case of exhausting allowed completion tokens.

The pattern I’ve noticed is that this consistently happens where you could expect to see a quotation mark (") as the next character. This leads me to the assumption that quotation marks are, intermittently, not being properly escaped in the generated text, thus prematurely ending the completion for the current ‘value’ in the current name-value pair in the JSON object.

I should emphasize this is intermittent. as this problem only affects some responses – others that contain quotation marks complete without issue.

To illustrate the issue, I might occasionally receive a completion like:

{
  "content": "It was Benjamin Franklin who famously said, "
}

Whereas the expected completion would be more like:

{
  "content": "It was Benjamin Franklin who famously said, \"Three can keep a secret, if two of them are dead.\""
}

Is there a way to avoid this happening?

idonotwritecode · February 8, 2024, 3:54pm

So there are a few different ways to do this, and I encourage you to try them to see which one works best for your workflow.

Option 1: You can taken the ouput and then pass it to another GPT4/GPT3.5 API, where the Prompt is ’ Here is a JSON output. Please ensure that the JSON output is in correct JSON format’

Option 2: You could give it a JSON structure as an example in your prompt, so that it always follows that format. This only works if all your API outputs are of the same format.

Option 3: You could add a condition to your Prompt, on how to handle such a scenario.

For example:
‘Make sure the following is considered as JSON best practice before generating your response:
within JSON response, escape double quotes by…’

Let us know if this works.

chouchou · February 8, 2024, 4:43pm

The issue is not with getting valid JSON (JSON best practices/correct JSON format/etc.). 100% of completions return a valid JSON object, and this object consistently validates against any schemas (example JSON structures) I provide in the prompts. (For what it’s worth, I’m using Python and validating all returned completions with a corresponding Pydantic model.)

Rather, the issue appears to be that, when generating text, GPT will occasionally generate an unescaped quotation mark " (I believe the correct term is a ‘ditto’ mark…), thus GPT prematurely terminates the string and either moves onto the next name-value pair, or closes the JSON object.

With that said, option #3 – providing explicit instructions – is one avenue I might explore. (Although it does feel a bit fragile to ask GPT to “Ensure any ditto marks you use within a string are escaped with a backslash.”… my general experience with using prompts to alter the probability of a particular word/character is not great…).

idonotwritecode · February 8, 2024, 5:20pm

I see your point.

Some of my instructions contain minute details such as this. I usually experiment with a few techniques, and then decide which one to use. Unlike code, with LLM’s the biggest lesson is that you can do the same things in 5 different ways, and its mostly about optimizing the prompt or the technique.

jwatte · February 8, 2024, 5:33pm

I’ve found that the model is better at generating YAML than JSON, and even better at generating Markdown than YAML.
So, if you can shove your data format into Markdown, you might want to use that as the transfer format.

Separately: If you ask a human being to generate a properly quoted JSON file, you may also not get 100% correctness in the first try … These models don’t work the way traditional programming does, and have some drawbacks that sometimes seem very surprising, given that there’s a computer underneath

chouchou · February 9, 2024, 10:58am

This makes perfect sense. So much so that I’m having one of those “Why didn’t I think of that?” moments.

In an attempt to turn this into a more generalizable principle, I guess the advice here is to prefer a data structure that a) Minimizes the use of reserved characters, and/or; 2) Uses reserved characters/patterns that are unlikely to appear in any names or values. Thus, I suspect, the most crude markdown possible (e.g., simply relying on headings for keys) is preferable to something like, let’s say, MSON.

Anyway, after playing around with a few test cases, it seems GPT is more likely to follow a markdown template reliably than it is to escape reserved characters reliably, so I’m marking this as the solution.

Topic		Replies	Views
{ "type": "json_object" } not always working Prompting gpt-4	5	979	January 2, 2025
Any idea on how to prevent double quotes inside of paragraphs? Prompting chatgpt	14	11768	December 18, 2023
Valid json every time? Prompting	17	12411	January 3, 2024
Quotation marks in API response breaking follow-up responses API	6	4432	December 18, 2023
Fine tuning models to generate JSON response Prompting codex , chatgpt , fine-tuning , api	6	6336	November 9, 2023

How do I ensure that JSON mode properly escapes quotation marks?

Related topics