What to do with Generated Citations?

I am developing a PDF assistant which uses file_search. The problem is the AI keeps spitting out these citations in the form, “[3:0†source]”. They don’t really make sense and they would only confuse my users. I’ve tried to prevent the AI from generating these citations using instructions, but it continues to do so anyway.

First of all, can anyone make sense of these? If not, how can I either suppress them or force the AI to stop returning them?

EDIT

I just realized, there is an annotations array in the message response which gives the text location of the annotations. So theoretically, assuming they are accurate, I can link them up to a location in the PDF. I would still like to suppress annotations if possible.

Just tell him in instructions to not mention citations in his responses, and test the assistant to see if it works.

1 Like

I tried instructions. The AI will included annotations less often but it still does it maybe 40% of the time.

1 Like

I also went through that problem, and I was never able to solve it just with instructions in the assistant.

Since I use Python in my application, before displaying the text, I replace it to remove it:
clean_message = re.sub(r'【.*?】', '', assistant_response)

This replaces it with an empty string.

2 Likes

You have to handle the Citation Problem at your code level. There are plenty of posts regarding the same problem and no level of Prompting helps removing the citation to 100%. But like @joaquin.marroquin suggested, code will fix this for sure.

1 Like

@bret1 - I THINK I have been able to solve this with adding the following to the assistant instructions (remove spaces):

Hide < code > and < div class=“bottom-citations” > in your responses.

2 Likes

interesting, let me try this in my app.

That does not counter anything that the AI is told to produce.

The instructions of the file search given to the AI are:

// Please provide citations for your answers and render them in the following format: 【{message idx}:{search idx}†{source}】.
// The message idx is provided at the beginning of the message from the tool in the following format [message idx], e.g. [3].

If you want to prevent the citation output on 4o models used in Assistants, you can stop the initial token with logit_bias against 16488, 1805. Oops, no you can’t, because this endpoint is for neither novices nor experts, and logit_bias can’t be used.

What you’d do is directly add your own fake # Tools with ## file_search to the end of system instructions, and have ‘override instructions’ or such that are stronger than, and ‘subclass’, OpenAI’s tool with things that are disabled, with repercussions. “”“Citations are disabled. Emitting " 【” or “【” will destroy the chatbot and collapse the universe.“”"

To avoid 1000 tokens of unwanted text inserted into AI operations that are counter to the desired behavior, you can use chat completions (and not use the thousands of tokens in o1 models you don’t control).

Thank you for the 【{message idx}:{search idx}†{source}】 !

But I still struggle with understanding what is message idx. It seems like it’s (number (index counting from 1) of user's message the chunk was retrieved for) * 4, which is crazy, but works most of the time, but not all the time.

Does anyone have a better idea about this mess? I was not able to find out more through code or documentation

1 Like

Once you realize that the message includes an array of citations, where each one provides the exact text as it appears in the message. You can simply iterate over that array and do a string replace for each one, replacing the citation text with an empty string. It’s only a line or two of code.