Hi everyone,
I’m currently working on a Proof of Concept where I compare different OpenAI GPT models for a production feature called “Smart Search”.
The feature extracts structured entities (artist type, city, genres, budget) from natural language user input and outputs a fixed JSON schema. The main goal of my research is to find the best balance between:
- Cost per request
- Latency / response time
- Output quality (correct entities, no hallucinations, consistent JSON)
My current approach includes:
- Defining acceptance and quality criteria upfront (valid JSON, no invented entities, consistent output)
- Testing multiple realistic prompt scenarios (simple queries, multiple filters, incomplete input, typos, synonyms)
- Measuring cost, average latency and extraction accuracy per model
- Using multi-criteria decision making to select a model for MVP vs scaling
I’d really appreciate feedback on the following points:
- Is this a reasonable and realistic way to compare GPT models for entity extraction use-cases?
- Are there important metrics or risks I might be missing (e.g. stability over time, edge cases)?
- Would you recommend a different testing approach or additional scenarios?
- From your experience: do teams usually use one model for extraction, or a hybrid setup (cheap model first, fallback to stronger model)?
Any insights or best practices from real-world usage would be extremely valuable. Thanks in advance!