Hello everyone,
I’m currently conducting a proof of concept research to evaluate OpenAI models for a Smart Search feature in an event-matching platform.
The feature extracts structured entities (such as artist type, city, genre, and budget) from natural language user queries.
My goal is to compare different OpenAI models in a methodical and reproducible way, focusing on:
-
Cost per request
-
Latency / response time
-
Output quality (correct entities, consistent and valid JSON)
I already plan to test multiple realistic user scenarios (simple queries, incomplete input, spelling mistakes, etc.), but I want to make sure the research setup itself is solid.
My questions are:
-
What is a good methodical approach to comparing OpenAI models for this type of task?
-
Which metrics or evaluation techniques are most useful for entity-extraction / smart search use cases?
-
How do you ensure results are fair, repeatable, and not prompt-biased?
-
Are there common pitfalls to avoid when benchmarking LLMs for production use?
Any advice, frameworks, or references would really help improve the quality of my research.
Thanks in advance!