Hi,
Issue
I’m using the Responses API with the file search tool, querying a vector store of many PDFs. When prompting for information that forces the model to query documents, it will query documents correctly, find the right information, compose an answer, and cite sources inline in the answer.
In that last step, whenever it tries to cite a source inline in the text, the text becomes garbled - clearly some part of the answer (text) is deleted when the citation is added. Unfortunately I can’t share the output for confidentiality reasons, but what I see is very similar to what’s reported in e.g. (reddit posts so I can’t link directly here):
- the ChatGPTPro subreddit, post 1jplmka
- the OpenAI subreddit, post 1jp7eeo
Because the wrongly deleted characters can be anything, this can often mess up the formatting as well, including e.g. citations within tables which may delete table markers and make the whole table impossible to render. This also seems to affect most models I’ve tried, including gpt-4, o3, o3-pro, gpt-5.
I tried to reproduce with much simpler examples (dummy file with just 1 line of text), but couldn’t.
Code
This is the Python code I’m using to make the calls:
response = openai.responses.create(
input=query,
model="o3-pro",
instructions=system_prompt,
tools=[
{
"type": "file_search",
"vector_store_ids": ['...'],
"max_num_results": 50,
},
{"type": "web_search"},
],
include=["file_search_call.results"],
)
and this is the code I’m using to confirm that the response text is garbled in every location where there is a citation index:
for annotation in response.output[-1].content[0].annotations:
print("\n\n\n ------")
print(response.output[-1].content[0].text[annotation.index-50:annotation.index] + '~' + response.output[-1].content[0].text[annotation.index:annotation.index+50])
(this outputs the 50 characters before and after each citation index and adds a ~ character in the middle just so I know where to look for - confirming that the text is garbled in every citation location)
I’ve also tried getting the raw response and it shows the same issue.
Possible cause
In another post in this forum named “How to handle file citations with the new Responses API?” , @stevecoffey you mention:
The idea here is that instead of having to remove the citations from the text, which is annoying and error-prone, you now have the option of inserting them.
which suggests there may be some kind of post-processing step to remove these markers from the raw model output before returning. If that’s the case, I believe that may be the issue, and that removal step (possibly a regex replace) is not very robust and is removing more content than it should.
Could you please help?