Benchmark & Evaluation Frameworks for Assistants

william.zebrowski · April 25, 2024, 2:30am

I’ve been searching for some time for decent LLM evaluation frameworks to use for OpenAI Assistants. My favorite (so far) has been deepeval but I’m still searching.

Some questions:

Has anyone had any success and/or good examples for evaluating their Assistants with certain frameworks? If so, can you share coding examples?
If you have an Assistant in production, is there a CI/CD process for evaluation user inputs against an assistants response and a frequency to do so?

Thanks for any help.

Topic		Replies	Views
Evaluation Tools for Assistants API gpt-4 , gpt-35-turbo , chatgpt , api , assistants-api	1	982	April 12, 2024
Evaluations for Assistants (with file_search) API assistants-api , evals	0	193	November 11, 2024
Best methods for regression testing our own OpenAI Assistant throughout development? API gpt-4 , api , assistants-api	0	224	August 2, 2024
Approach for using Evals for Assistants? API assistants , assistants-api , evals	2	220	March 8, 2025
Recommended API Testing Tool for Working with OpenAI APIs? API gpt-4	1	640	August 8, 2024

Benchmark & Evaluation Frameworks for Assistants

Related topics