Which model is best for speed and accuracy?

I’m building an AI agent chatbot using OpenAI’s API. Which model would be the best choice for speed and accuracy: gpt-4o-mini-2024-07-18 or gpt-3.5-turbo?

1 Like

gpt-4o-mini wins for speed.

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4o-2024-08-06 4 0.739 41.698
gpt-4o-2024-05-13 4 0.730 64.069
gpt-4o-2024-11-20 4 0.676 37.113
gpt-4o-mini 4 0.558 111.561
gpt-3.5-turbo 4 0.571 63.459

(this is me running all 20 API call trials in parallel, with a small messages input.)

gpt-4o-mini has a decidedly different response quality and understanding, especially in a longer chat. gpt-4o-mini also allows much more as input messages. It might predictively chat well, but it also does not adapt well to original tasks an API developer might “program”. You will need to evaluate the quality of each.

4 Likes

does the last version of GPT-4o will be the last stable version - i mean gpt-4o-2024-08-06?

1 Like

one more thing i also used o3-mini but i noticed that its very slow. why?

If you simply specify “gpt-4o”, you will currently get gpt-4o-2024-08-06. This is a “recommended model” pointer.

The “O” series of AI models, such as o3-mini are reasoning models. They generate internal planning and thoughts that you do not see, at higher expense. Think of them as puzzle and problem solvers instead of conversationalists. It also takes a reasoning_effort parameter so you can tune how much thinking and dedication is allotted before responding.

https://platform.openai.com/docs/models#o3-mini

I assume this is a polite way of saying “comparatively bad”, but I’d like to hear more about your specific experiences - where it falls short, and how you believe it is best used. I’ve spent extensive time with it as well, and I’ve been wondering recently, if those shortcomings in comprehension are useful as guidelines to better design our prompts, tools, etc.

For example, I recently had a problem which was easily solved by using 4o rather than 4o-mini, but after some experimentation, I found a few things which were confusing 4o-mini, and got it working with the smaller model. The question is, and as far as I know there’s no way to test this, whether doing so also improves comprehension in the larger models.

“mini” models, any AI model that has a lower parameter count, simply doesn’t have as much embeddings space to encode layers of pretrained knowledge. GPT-4 is going to do a better job reciting truthful statistics about the 1922 World Series team lineups, Amiga game developers who worked at a company, or even to write natively in ᐃᓄᒃᑎᑐᑦ than the predictions that come out of more compressed models.

Sure, of course. I don’t think anybody is relying on mini models for knowledge. They’re primarily used for RAG and function calling. But even for simple application, I find 4o-mini to be very fragile and demanding of “technique” to get it to work in the place of 4o. I’ve seen nearly no discourse on it, which is why I’m curious if you have anything interesting cases you’d like to share.

For a balance of speed and accuracy, gpt-4o-mini-2024-07-18 is likely the better choice, as it’s optimized for efficiency while maintaining strong performance. However, if cost is a major factor, gpt-3.5-turbo is a solid alternative. Testing both on your use case is recommended.