Responses with File Search: few characters in text around citation indexes wrongly deleted

Hi,

Issue

I’m using the Responses API with the file search tool, querying a vector store of many PDFs. When prompting for information that forces the model to query documents, it will query documents correctly, find the right information, compose an answer, and cite sources inline in the answer.

In that last step, whenever it tries to cite a source inline in the text, the text becomes garbled - clearly some part of the answer (text) is deleted when the citation is added. Unfortunately I can’t share the output for confidentiality reasons, but what I see is very similar to what’s reported in e.g. (reddit posts so I can’t link directly here):

  • the ChatGPTPro subreddit, post 1jplmka
  • the OpenAI subreddit, post 1jp7eeo

Because the wrongly deleted characters can be anything, this can often mess up the formatting as well, including e.g. citations within tables which may delete table markers and make the whole table impossible to render. This also seems to affect most models I’ve tried, including gpt-4, o3, o3-pro, gpt-5.

I tried to reproduce with much simpler examples (dummy file with just 1 line of text), but couldn’t.

Code

This is the Python code I’m using to make the calls:

response = openai.responses.create(
    input=query,
    model="o3-pro",
    instructions=system_prompt,
    tools=[
        {
            "type": "file_search",
            "vector_store_ids": ['...'],
            "max_num_results": 50,
        },
        {"type": "web_search"},
    ],
    include=["file_search_call.results"],
)

and this is the code I’m using to confirm that the response text is garbled in every location where there is a citation index:

for annotation in response.output[-1].content[0].annotations:
    print("\n\n\n ------")
    print(response.output[-1].content[0].text[annotation.index-50:annotation.index] + '~' + response.output[-1].content[0].text[annotation.index:annotation.index+50])

(this outputs the 50 characters before and after each citation index and adds a ~ character in the middle just so I know where to look for - confirming that the text is garbled in every citation location)

I’ve also tried getting the raw response and it shows the same issue.

Possible cause

In another post in this forum named “How to handle file citations with the new Responses API?” , @stevecoffey you mention:

The idea here is that instead of having to remove the citations from the text, which is annoying and error-prone, you now have the option of inserting them.

which suggests there may be some kind of post-processing step to remove these markers from the raw model output before returning. If that’s the case, I believe that may be the issue, and that removal step (possibly a regex replace) is not very robust and is removing more content than it should.

Could you please help?

1 Like

The playground is not to be trusted, it rewrites with false delivery of what a developer needs to know. OpenAI shouldn’t be making an application, they should be making a development tool.

Get your API call’s print(response.output_text) to see if it a true representation, with at most an extra space or two.

If that has artifacts, which I don’t expect, try this approach:

  • Use gpt-4.1;
  • use top_p: 0

Justification:

  • o3 pro generates a lot of its own thinking, also willing to make continued use of tools to fill up its context before responding, distancing the information that needs to be recited without error;
  • random sampling you can’t control allows lottery-like selection of tokens, even when verbatim recitation of input is indicated.

You can reduce the thinking effort to keep the AI on track.


A particular format must be emitted by the AI:

"\ue200filecite\ue202turn3file2\ue201"`

This should be stripped by the API backend, turned into a position, and never be possible or even knowable by you as developer:

image

But clearly the Playground, API backend, or the AI is doing some mangling.

You can try to heighten the quality: “Your output is fully UTF-8 compliant, able to natively send to the recipient all code points for reproduction without escapement, even those Unicode glyphs beyond your understanding you are told to output.”

The Reddit post is instead the AI going mad with web search citations.

Thanks for the response! To clarify some points:

  1. I’m not using the playground - the original post is already about what I see in the raw API responses, with either response.output_text or response.output[-1].content[0].text (same thing in this case)
  2. This is not about the artifacts/citation markers in the text, I don’t have that issue. The problem is that part of the original text response seems to have been deleted in the locations where the annotations are indexed
  3. This also happens with non-reasoning models (e.g. gpt-4.1) although it might be less frequent, can’t say for sure
  4. I know it’s on ChatGPT/Playground, but the second reddit post I mentioned (“Deep Research bugged when writing sources”) seems to show exactly the same problem, regardless if the citations are for web search results instead of files - the symptom seems to be the same, i.e., original text characters around the citation indexes are deleted (hence the user reporting “cut words”.

The AI is what produces the citation. Where you get 77, the AI has produced its special format at character 77 in the output. Deleting is what the API backend does to hide that and make a response normalized.

With API changes and model mashups, the length and type of data to strip after identification may have gone wrong. For example, deleting 7 code points when the AI wrote more, like a longer index number, or not using the closure for end detection, you’d get added offsets depending on how the pull-and-strip proceeds.

Or models just can’t produce this sequence properly.

It would be nice if developers had control of the tool instructions, to not even ask for annotations, but nooooo.

OpenAI gets to figure this out, based on API developers that aren’t sharing a reproduction example.


The standalone file search API can be offered to AI through a developer function, or embeddings of your own can be cheap enough to always inject fresh with a threshold and never double-up on API calls using tools. No more annotations other than by your command.

Hi @pca,

We’ve found the same occurs when web search and file search tools are included and the user’s query only invokes file search. If it invokes both tools it works fine.

When you remove the web search tool it seems to fix the issue - I have logged a ticket with OpenAI to see if they can fix.

e.g See all the spelling mistakes from the responses API from a simple question. All the words are so jumbled its unreadable. Same is happening in playground. 4.1 seems to work.

Thanks for sharing - we’ve found exactly the same. Since web sources and file search sources are cited differently (inline vs with annotations) and the issues occur exactly where the annotation markers are, we suspect it could be due to injected system instructions on how to cite sources.

We’re also trying to raise the problem with OpenAI - will update you here if we find a solution.

2 Likes

@pca I tried with using the new Web Search tool launched today but still the same error.

1 Like