The OpenAI Evals product now lets you evaluate tool use! You can now use tools and Structured Outputs in generations completed through both the API and web platform. You can then evaluate the tool calls based on the arguments they received, and the responses they returned.
This supports tools that are OpenAI-hosted, MCP, and non-hosted.
Please let me know if you have any feedback on this!
You can see examples of this in several new cookbooks, for:
-
Web search evaluation - Evals API Use-case - Web Search Evaluation
-
Tool evaluation - Evals API Use-case - Tools Evaluation
-
MCP evaluation - Evals API Use-case - MCP Evaluation
-
Structured output evaluation - Evals API Use-case - Structured Outputs Evaluation