Feedback on GPT model selection methodology for entity extraction (cost vs latency vs quality)

Cedrich · January 22, 2026, 10:07am

Hi everyone,

I’m currently working on a Proof of Concept where I compare different OpenAI GPT models for a production feature called “Smart Search”.

The feature extracts structured entities (artist type, city, genres, budget) from natural language user input and outputs a fixed JSON schema. The main goal of my research is to find the best balance between:

Cost per request
Latency / response time
Output quality (correct entities, no hallucinations, consistent JSON)

My current approach includes:

Defining acceptance and quality criteria upfront (valid JSON, no invented entities, consistent output)
Testing multiple realistic prompt scenarios (simple queries, multiple filters, incomplete input, typos, synonyms)
Measuring cost, average latency and extraction accuracy per model
Using multi-criteria decision making to select a model for MVP vs scaling

I’d really appreciate feedback on the following points:

Is this a reasonable and realistic way to compare GPT models for entity extraction use-cases?
Are there important metrics or risks I might be missing (e.g. stability over time, edge cases)?
Would you recommend a different testing approach or additional scenarios?
From your experience: do teams usually use one model for extraction, or a hybrid setup (cheap model first, fallback to stronger model)?

Any insights or best practices from real-world usage would be extremely valuable. Thanks in advance!

_j · January 22, 2026, 11:32am

Named entity recognition for entity identification and extraction is an API AI model use-case and instruct trained task for a long time, from before you’d just chat with the AI about what you want.

You can solve your JSON issue quite easily today:

However, you must also prevent fields from being forced, by offering a null output as an option for fields, or by using an alternate anyOf JSON subschema for when there is nothing to report. Otherwise, you will indeed get hallucinations and useless values filled in.

Another concern is multiple ambiguities in the data, whether you have sanitized inputs such as invoices that only have one correct set, or if instead you have arbitrary text where a single field “paper_author” and “author_institution” may be overloaded. That is more about synergy between the input and response format.

This is usually a task needing low AI model knowledge if you are not needing to infer deeper information not presented. Small “mini” models can do well, and in fact, large frontier models may be overfitted on answering from knowledge and their chat format.

Starting point: gpt-4.1, then grade results of gpt-4.1-mini to see performance when you save on pricing (mostly the input).

Topic		Replies	Views
Best practices for evaluating OpenAI models for smart search (research approach) Community	1	122	January 21, 2026
Building a Cost-Effective Support Bot with Multiple GPT Models - Your Thoughts? API api , semantic-search , knowledge-files	2	494	May 29, 2024
How to optimize + what do you recommend? Feedback api	1	158	June 13, 2025
Struggling with fine-tuning GPT for generating JSON API fine-tuning , fine-tuning-problems	1	436	July 9, 2024
Best OpenAI plan for document analysis with OCR and Power Automate? Community api	3	6139	March 17, 2025

Feedback on GPT model selection methodology for entity extraction (cost vs latency vs quality)

Related topics