I’m looking for best practices regarding building evals for the MCP apps. So far, we have used the Evals API to individually evaluate tool descriptions, but for a more complex app with complex multi-turn tools like the Memories Video Editor, we are searching for a way to evaluate the entire MCP set as a whole.
What have you found works so far?
I’m thinking about concatenating all of the descriptions together, with a mock schema for each tool, in order to replicate the instructions ChatGPT would receive when it discovers the app via its api_tool.list_resources. As I’ve seen, the format looks something like this:
═══════════════════════════════════════════════════════════════════════════════
🎬 MEMORIES VIDEO EDITOR - Tool Suite Overview
═══════════════════════════════════════════════════════════════════════════════
...... (tool suite instructions)
═══════════════════════════════════════════════════════════════════════════════
// Tool description.
// each line
// prefixed with the comment //
type /Memories Video Editor/link_XYZ/tool_name = (_: {
// input schema
...
}) => any;
Once concatenated, send the entire instruction set to the eval suite, like evals API, Promptfoo, or Deepeval (still analyzing which one to use).
I’m curious if you’ve found a simpler or easier way to do it, in hopes of not reinventing the wheel.