I have been analyzing similar questions on the forum for some time but I can’t find a solution to my problem.
We have developed an application for auditing phone calls after transcription. We sent to the OpenAI API (gpt-4o-2024-11-20) the following information:
Transcript.
30-40 questions that I want the wizard to analyze about the transcript.
Most of the questions the wizard is able to evaluate correctly and recognizes simple and complex concepts. My problem is that there are times when the assistant seems to make up content that does not exist in the conversation. For example it may tell me that the telephone agent confirms that the customer has received an SMS in the conversation and there is nothing similar in the text of the transcript.
Basically it seems to hallucinate sometimes, I don’t know if it is because of the large size of the input (5000 tokens in, 2000 tokens out) or because of the “hurry” in answering the wizard.
The less temperature I leave to the wizard (0.02 is the current one) the better it works, but I can’t avoid some cases of hallucination.
I have tried to explain in the wizard script not to be in a hurry to answer, to make sure that it is based exclusively on the text of the transcript, etc. but as I say I can’t correct it at all.
Sorry for the long text, can you think of any advice to improve the results? Is there a way to tell OpenAI that I am not in a hurry to get an answer? Do you think that using the 4o model is the best option?
Thank you very much to those of you who read it and even more to those who answer
However, there is a sense of “response is getting too long, better finish”, or "better minimize the length of each of 40 parts requested.
So you might need to use multiple runs with less tokens being generated, under 1000. This also prevent a long list of questions and a growing answer from distracting from the input document.
The AI making its choice of a word “yes” or “no” is just predicting the certainty of one of those two tokens (or others).
When the output generation is far away from the document and the questions, the choice becomes even more random.
For an answer more grounded in reality, you can give guided questions, having the AI repeat the question it is currently answering, first producing some truth about where that information comes from, and then answering. This can be structured in a response format:
[
{
"question_index": 1,
"question": "In document, has customer received an SMS",
"preliminary_reasoning": "The document doesn't seem to discuss text messages anywhere",
"best_citation_from_document": null,
"answer": "No"
},
{
"question_index": 2,
...
Using a strict structured schema, you can make only two enum strings possible, giving the more desirable one clear choice, which no distraction from ranks, such as whether the AI should write “yes” or “Yes”.
Yes. You can tell it to “chill”, at least somewhat.
As well Andrew Ng on Deep Learning AI, Anthropic in their prompt Guide and for sure OpenAI in their guide for ChatGPT tell you:
You may add to your prompts things, like:
“Think Step by Step.”
“Take a deep breath and count to 10, then start”.