Evaluations and Chat completions : Need support for tool use and image

Hey OpenAI Team and Community, So we have been trying out the evals and chat completions storage from open ai platforms dashboard for our eval use case and we had a few issues that made it unusable for us.

When attempting to run evaluations, we noticed that:

  • The dashboard only imports the system and user prompts, ignoring the tools specified for evaluations as well as image input.
  • The assistant’s outputs via function calls are not captured during the evaluation, resulting in empty outputs and, consequently, failing tests.

Interestingly, when we switch to a different model (e.g., GPT-4) and let it generate responses on the fly for the same inputs, the evaluations produce outputs and pass the tests. ie, only the assistant content is captured by the evals and it ignores the tool use output.

  • Is there a current plan to support tools and function calling in the evaluations dashboard?
  • If not presently supported, do you have an estimated timeline for when this feature might be available?
1 Like