I have been using openAI through API for my work project from about 4 months now.
Today, I identified a very potential problem that is only confined to gpt-4o surprisingly with the “large numbers” (not even arithmetic).
In my project flow, I will have a set of intermediate answers and I have to merge them at the end to get a final answer to the original prompt.
So, there is only a single intermediate answer in the current sample which is:
{
"sub-prompt": "What is the current population of Austin, Texas?",
"answer": {"current_population_of_Austin_Texas":961855}
}
…and the merged answer from gpt-4o:
{"population_of_Austin_Texas": "The current population of Austin, Texas is not directly available from the provided data. Please refer to the latest census or city data for the most accurate and up-to-date information."}
from gpt-4o-mini:
{ "current_population_of_Austin_Texas": "961855"}
from gpt-3.5-turbo:
{"population_of_Austin_Texas": "961855"}
If I removed one number from the population count, gpt-4o worked. If I give any large numbers (greater than 5-digit numbers), it is failing.
I know the LLMs are quite problematic around numbers because of tokenization. Also, gpt-4o uses a different tokenizer which is o200k_base from the previous models which use cl100k_base.
But, isn’t the gpt-4o-mini uses the same? Why it is working then?
I have checked with the tokenizers as well, there is not any weird problem that I can identify.
If this is a known problem, I apologise for the repetition. I have checked everywhere but, none claimed that the mini and the previous version models are better.
Edit 1: temperature is set to ‘0’. max_tokens is set to ‘5000’. There is a system prompt involved for this which is same across all the tests that I have made with the different models. I have attempts around 20 times with each of the models. The success rate of gpt-4o is 0%. Other models: 100%.
Edit 2: Surprisingly, “gpt-4o-2024-05-13” is working. After structured outputs are introduced, both the latest snapshots are not working.
Edit 3: While giving a 5-digit number, gpt-4o latest worked. Maybe, a problem with numbers having greater than 5 digits.