Benchmark & Evaluation Frameworks for Assistants

I’ve been searching for some time for decent LLM evaluation frameworks to use for OpenAI Assistants. My favorite (so far) has been deepeval but I’m still searching.

Some questions:

  1. Has anyone had any success and/or good examples for evaluating their Assistants with certain frameworks? If so, can you share coding examples?
  2. If you have an Assistant in production, is there a CI/CD process for evaluation user inputs against an assistants response and a frequency to do so?

Thanks for any help.