Feedback on GPT model selection methodology for entity extraction (cost vs latency vs quality)

Hi everyone,

I’m currently working on a Proof of Concept where I compare different OpenAI GPT models for a production feature called “Smart Search”.

The feature extracts structured entities (artist type, city, genres, budget) from natural language user input and outputs a fixed JSON schema. The main goal of my research is to find the best balance between:

  • Cost per request
  • Latency / response time
  • Output quality (correct entities, no hallucinations, consistent JSON)

My current approach includes:

  • Defining acceptance and quality criteria upfront (valid JSON, no invented entities, consistent output)
  • Testing multiple realistic prompt scenarios (simple queries, multiple filters, incomplete input, typos, synonyms)
  • Measuring cost, average latency and extraction accuracy per model
  • Using multi-criteria decision making to select a model for MVP vs scaling

I’d really appreciate feedback on the following points:

  1. Is this a reasonable and realistic way to compare GPT models for entity extraction use-cases?
  2. Are there important metrics or risks I might be missing (e.g. stability over time, edge cases)?
  3. Would you recommend a different testing approach or additional scenarios?
  4. From your experience: do teams usually use one model for extraction, or a hybrid setup (cheap model first, fallback to stronger model)?

Any insights or best practices from real-world usage would be extremely valuable. Thanks in advance!

Named entity recognition for entity identification and extraction is an API AI model use-case and instruct trained task for a long time, from before you’d just chat with the AI about what you want.

You can solve your JSON issue quite easily today:

However, you must also prevent fields from being forced, by offering a null output as an option for fields, or by using an alternate anyOf JSON subschema for when there is nothing to report. Otherwise, you will indeed get hallucinations and useless values filled in.

Another concern is multiple ambiguities in the data, whether you have sanitized inputs such as invoices that only have one correct set, or if instead you have arbitrary text where a single field “paper_author” and “author_institution” may be overloaded. That is more about synergy between the input and response format.

This is usually a task needing low AI model knowledge if you are not needing to infer deeper information not presented. Small “mini” models can do well, and in fact, large frontier models may be overfitted on answering from knowledge and their chat format.

Starting point: gpt-4.1, then grade results of gpt-4.1-mini to see performance when you save on pricing (mostly the input).