Gpt-4o-mini (even gpt-3.5-turbo) works but gpt-4o doesn't

I have been using openAI through API for my work project from about 4 months now.
Today, I identified a very potential problem that is only confined to gpt-4o surprisingly with the “large numbers” (not even arithmetic).

In my project flow, I will have a set of intermediate answers and I have to merge them at the end to get a final answer to the original prompt.

So, there is only a single intermediate answer in the current sample which is:

{
"sub-prompt":  "What is the current population of Austin, Texas?",
"answer": {"current_population_of_Austin_Texas":961855}
}

…and the merged answer from gpt-4o:
{"population_of_Austin_Texas": "The current population of Austin, Texas is not directly available from the provided data. Please refer to the latest census or city data for the most accurate and up-to-date information."}

from gpt-4o-mini:
{ "current_population_of_Austin_Texas": "961855"}

from gpt-3.5-turbo:
{"population_of_Austin_Texas": "961855"}

If I removed one number from the population count, gpt-4o worked. If I give any large numbers (greater than 5-digit numbers), it is failing.
I know the LLMs are quite problematic around numbers because of tokenization. Also, gpt-4o uses a different tokenizer which is o200k_base from the previous models which use cl100k_base.
But, isn’t the gpt-4o-mini uses the same? Why it is working then?

I have checked with the tokenizers as well, there is not any weird problem that I can identify.

If this is a known problem, I apologise for the repetition. I have checked everywhere but, none claimed that the mini and the previous version models are better.

Edit 1: temperature is set to ‘0’. max_tokens is set to ‘5000’. There is a system prompt involved for this which is same across all the tests that I have made with the different models. I have attempts around 20 times with each of the models. The success rate of gpt-4o is 0%. Other models: 100%.

Edit 2: Surprisingly, “gpt-4o-2024-05-13” is working. After structured outputs are introduced, both the latest snapshots are not working.

Edit 3: While giving a 5-digit number, gpt-4o latest worked. Maybe, a problem with numbers having greater than 5 digits.

gpt-4o can successfully ask a strict tool, and successfully reproduce a tool result into a non-strict schema so it can output whatever object key name is best, and a few other ways I tried to evoke the problem and a simulation of your varying AI output.

Second go is the AI invoking my “tool” – unlike the first try of of the AI just answering on its own, and not using a city_info tool with an enum for population. duh. Just as duh as everyone experiences with gpt-4o recently, “why it doesn’t know my PDF”.

You might try the new gpt-4o-2024-11-20 model.

You might just change up you usage in a different way, or give a system prompt that avoids ignorance, such as:

“To answer, you ALWAYS carefully review all previous messages and reproduce the requested statistic ONLY if present in past factual messages (such as tool returns or document returns). If after careful review, you do not have source data you can reproduce directly from messages, you output -1 into numeric data fields instead of an answer”

That should avoid wishy-washy chat and JSON-busting.

Hi, there is no need of tools here in my pipeline. I already have an intermediate answer that answers a part of the original prompt.
That intermediate answer is generated through scraping, rag pipeline and llm calls.
Also, there is a system prompt involved in my request which clearly specifies the llm to merge the set of intermediate answers that are presented without loosing any data. gpt-4o never lost data in any of the other samples.
This is the only single sample that is weirdly not working. Also, with the “same system prompt”, “same settings”, “same everything” as mentioned before, other lesser complex models are working.

system prompt: “You are a highly skilled language model tasked with integrating answers from multiple sub-prompts (derived from the given original prompt) into a cohesive and comprehensive response to the original prompt.\nBefore answering, follow these instructions:\n1. Understand the original prompt to get the overarching context and intent.\n2. For each sub-prompt and it’s answer, go through the related webpages, and their titles while evaluating.\n3. Final response should be coherent and unified, avoiding repetitions and contradictions.\n4. Emphasize more on answering original prompt without loosing the information obtained through the sub-prompts.\n5. Clearly distinguish and integrate content sourced from different related webpages when necessary.\nMaintain a neutral, concise, and accurate tone, while adhering to the user’s query context.\nThe answer should be in JSON format without any other noise and also should be as concise as possible.

This is the exact system prompt which is being used everytime.

Also, as mentioned in the edit above, I have used all the three available snapshots of gpt-4o. The oldest one “gpt-4o-2024-05-13” only worked.

The AI wants to write that? What do you do next?

logit_bias: {976:-50}

It can’t write " The current population of Austin, Texas is not directly available" without “The”

I don’t understand what you are asking me. I am pointing out that gpt-4o is unable to figure out the existence of an answer if I am providing it an intermediate answer and asking for a final answer if the intermediate answer is a 6-digit number or greater.

It is not supposed to say the data is not available. I am clearly providing an intermediate answer which does have the population count. It just needs to present the same in the final answer as well.
If you are not getting why am I providing an intermediate answer, it is because this is a pipeline involving query decomposition which results in multiple intermediate answers which are later merged to form a final merged answer which will be the end result.

I’m telling you that at temperature 0, you are only going to get one choice, the top logit for that position during generation, where “The” takes the place of a number-representing token. If you want variety, you leave temperature and top_p at 1 - and then generate until you see the effects of multinomal sampling.

You might think “it’s not going to write the correct thing 33% of the time?” looking at the probs above?

All it takes is a top rank flip and temperature:0, and it’s not going to write the correct thing 100% of the time.

There could be any number of factors why the 961 token has lower certainty than “The” in that particular case of token-by-token generation inside the value of the JSON - Like “The” being 1000 times more common. Or a technical discussion about softmax embeddings space parallelogram blockings. Or the model sucks.

If you want numbers written, you get the logprobs for a reliable position, like after the quote of a structured JSON, and demote the other words.

1 Like

What you have stated completely makes sense. Thank you so much for your insights. I will explore this problem in this direction. That is awesome. Thanks again!

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.