Structured output with responses API returns tons of \n\n\n\n

If I want to include the content after “\n\n \t\t” as part of the input, would the above method still resolve the issue? In my case, I need to save both the part before and after “\n\n \t\t” from the model’s response as input.

The AI going into a repeating stream of linefeeds or tabs likely doesn’t terminate in any useful way. There is no frequency_penalty parameter that can break this on Responses API endpoint, so all you can do there is specify a high temperature and hope that random lower rank tokens could be sampled.

Using a stop parameter, only offered on Chat Completions (where you can discourage these constant token loops after they have been produced for a while), will terminate the AI generation if there is a recognized string produced, which can be cross-token. Since the token-by-token production was terminated, there is no more text after the stop sequence to provide to you. Stopping was the whole point.

This stop sequence string and any fragments after it is also then removed, thus not giving you all the tokens the AI model generated. That prevents some straightforward uses, like detecting the end of a JSON, or finding which of many sequences in your list was detected.

You would need to use max_completion_tokens to set an output budget, and take all that the AI wants to write up to your limit, if you still find any use for an output that has this persistent fault. (You could emulate your own stop sequence in a streaming API response, detecting within output and closing the connection only when you like)

I wonder why @OpenAI_Support has not addressed the issue yet since it’s been quite some time already.

In browser-use we have exactly the same problem. 100k tokens of \t after a valid structured output.

With gpt-4.1-mini we could finally fix it by setting frequency_penalty to 0.2. This fixed all problems for us.

Now the big problem: gpt-5-mini does not support this parameter anymore. So in 10% of the cases the model responses times out after 90 secends and thousands of generated \t tokens.

@OpenAI_Support can you please add frequency_penalty back to the gpt-5 series?

What do you suggest, the reasoning models are not usable with complex nested structured output because of this!

1 Like

same problem here!

lots of /n or /t in the reply

Solved: “Invalid \uXXXX escape” errors in Structured Outputs - The Non-Breaking Space Problem

Problem

When using OpenAI’s Structured Outputs (JSON schema mode) with gpt-4o-mini, I encountered JSON parsing errors when processing large document sets:

  • JSONDecodeError: Invalid \uXXXX escape: line 1 column 24597 (char 24596)
  • finish_reason='length' (hitting max_completion_tokens limit)
  • LLM response filled with thousands of null bytes (\u0000)

The errors appeared consistently when processing PDFs with large prompts (615KB, 315 snippets). The LLM would hit the token limit and the response would be truncated mid-escape-sequence.

Root Cause

Non-Breaking Space (U+00A0)

This single character caused the LLM to generate repetitive null bytes when processing large prompts. The null bytes filled the max_completion_tokens limit, causing truncation and malformed JSON.

Non-breaking space is extremely common in PDFs (used for spacing/formatting) but invisible to human review. When accumulated across hundreds of snippets, it triggers unusual LLM behavior.

Solution

Replace non-breaking spaces (and zero-width characters) with regular spaces BEFORE sending to the LLM. This preserves word boundaries while preventing the issue.

Key principle: Clean at source (parse time), not at consumption.

Python Implementation

import re

# Characters that should be replaced with spaces to preserve word boundaries
_REPLACE_WITH_SPACE = re.compile(r'[\u00A0\u200B\u200C\u200D\u2060]')

# Formatting-only invisible characters that can be safely removed
_FORMATTING_CHARS = re.compile(r'[\uFEFF\u200E\u200F\u202A-\u202E]')

def sanitize_for_json(text: str) -> str:
    """Remove invisible characters that cause LLM processing issues.

    Replaces non-breaking spaces and zero-width characters with regular spaces
    to preserve word boundaries. Removes formatting-only characters (BOM, bidi
    controls) that don't affect word boundaries.
    """
    if not text:
        return text

    # Replace invisible spaces with regular spaces to preserve word boundaries
    text = _REPLACE_WITH_SPACE.sub(' ', text)

    # Remove formatting-only invisible characters
    text = _FORMATTING_CHARS.sub('', text)

    return text

Character Details

Replaced with space (preserve word boundaries):
- \u00A0 - Non-breaking space (THE KEY FIX - causes LLM null byte generation)
- \u200B - Zero width space
- \u200C - Zero width non-joiner
- \u200D - Zero width joiner
- \u2060 - Word joiner

Removed (formatting only):
- \uFEFF - BOM / Zero width no-break space
- \u200E - Left-to-right mark
- \u200F - Right-to-left mark
- \u202A-\u202E - Bidi embedding/override controls

Usage

from openai import OpenAI

# Clean text BEFORE sending to LLM
cleaned_text = sanitize_for_json(raw_document_text)

# Now use cleaned_text in your prompt
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract structured data..."},
        {"role": "user", "content": cleaned_text}
    ],
    response_format={"type": "json_schema", "json_schema": {...}}
)

Results

After implementing non-breaking space replacement:
- ✅ 100% success rate across all documents
- ✅ Zero JSON parsing errors
- ✅ No null byte generation
- ✅ Preserves all visible Unicode and word boundaries

Key Insights

1. Volume matters: The issue only appears with large prompts. Small prompts work fine even with non-breaking spaces.
2. LLM behavior: The LLM doesn't fail directly - it generates null bytes when encountering certain Unicode patterns in large contexts, which fills the token budget.
3. Word boundaries matter: Replace with space, don't just remove - preserves text readability and prevents words from concatenating.
4. Smart quotes and other visible punctuation are NOT problematic - only non-breaking space causes this issue.

When This Applies

- PDF parsing, HTML scraping, OCR output
- Large documents (100+ pages) with many snippets (300+ excerpts)
- Any LLM at max_completion_tokens limits
- Non-breaking space is pervasive in PDFs but invisible during review

Hope this helps others encountering similar issues!

Much cleaner - removed the entire binary search methodology section and focused on what matters: the problem, the cause, and the solution.

Hey niels, I am in the situation where I am getting lots of \n and \t in the output not null bytes. Unfortunately this doesn’t seem to fix my problem.

I tried sanitizing my inputs with your method but no difference in result.