Facing a strange bug while using gpt-3.5-turbo-16k : output is getting appended twice

Model : gpt-3.5-turbo-16k
Issue: While using the aforementioned model as a retriever in langchain framework, it is generating the output twice and appending it side by side.

Is this hallucination? Or since the model is heavily used in a pipeline, it can cause such erratic behavior once in a while?

For example:
Prompt : “What is the radius of sun?”
Answer : “The average radius of the Sun, considering its photosphere, is approximately 696,340 kilometers (about 432,685 miles).
The average radius of the Sun, considering its photosphere, is approximately 696,340 kilometers (about 432,685 miles).”

Is this logged as a single output from the AI model? Or is it instead the AI agent autonomously making two calls to obtain a response?

If the former, it may be a symptom that is also triggered currently by developers that have attempted to fine-tune while including functions: The AI doesn’t halt its output in the desired place by emitting a stop token, but instead proceeds to generating more text, in this case, repeating the same text.

AI models are probability machines that emit the best next token - or a token chosen more randomly if you have default temperature and top-p settings. This affects the operations as well as the language contents.

Let’s experiment. I add a single linefeed like your AI did instead of emitting a “stop” special token at the end of the sentence.


Despite the continuance to the next line, the AI still wanted to stop writing as the most likely token output to produce next.

Insert an extra two linefeed characters, and the AI writes more, at least not repeating the same thing in this case:

You can reduce the chance of unlikely tokens by reducing temperature and top-p API parameters.

You can reduce the chance of repeating text with the frequency_penalty or presence_penalty API parameter.

Increasing the probability of a “stop” special token is a capability not currently exposed by the API.

1 Like

The model API is not making two calls, the output is appended again verbatim.
Which is quite strange, because this is one in thousand kind of situation we have.
Other than that the model api is working perfectly fine.

I am assuming that the model was hallucinating after multiple calls, because there was a lot of load on the pipeline.

“load on the pipeline” sound like a human hallucination.

The AI model sees all that the agent framework gave it as a context-loading input just in a final turn, and produces output solely based on that.

The AIs just have a natural tendency to repeat, only mostly trained out of chat models. I get exactly what is expected when I want to demonstrate:

No no, i meant the cloud pipeline, the model is deployed on cloud.
It works perfectly fine for the use-case, we just encountered one of the result which was duplicating. So i am assuming that its because of heavy API calls that model faced, that might have resulted in hallucination behavior.

Every AI API call is independent, to language inference servers that thousands upon thousands are using, and no promise your follow-up call would get to the same server or even the same datacenter. So that theory doesn’t have much basis.

Just an artifact from the ‘roll-of-the-dice’ token creation and sampling process that makes AI model writings appear more human and less robotic.

I’m having the same problem, using the same prompt as the standard model, the 16k token version duplicates the output at least twice which is very frustrating.

I’m seeing the same thing here. A request to parse freeform text into JSON results in a long response with multiple instance of the same JSON string. It happens only with gpt-3.5-turbo-16k. Not gpt-3.5-turbo-1106 or gpt-4.