I’ve been searching for some time for decent LLM evaluation frameworks to use for OpenAI Assistants. My favorite (so far) has been deepeval
but I’m still searching.
Some questions:
- Has anyone had any success and/or good examples for evaluating their Assistants with certain frameworks? If so, can you share coding examples?
- If you have an Assistant in production, is there a CI/CD process for evaluation user inputs against an assistants response and a frequency to do so?
Thanks for any help.