Prompt Evaluations at Scale for Production

Hi everyone,

I wonder how everyone is comparing their prompts when building for production. For example, say you make a small change to your prompt and wanna see how it behaves overall (So need to test this n times). Except for writing a script, what are other alternatives? I’ve heard of solutions like LangSmith but I’m not sure how useful these tools are and how widely they’re used.

Thanks,
Raivat

Hey @raivat1, we are building getmaxim.ai , its experimentation suite addresses all your prompt engineering needs, helping you rapidly and systematically iterate on prompts.

You can:

  1. Test, iterate, manage, and version prompts. You can organize prompts in folders and sub-folders and attach tags to them. You can also version your prompts with custom description allowing you to easily track changes and compare across versions.

  2. Run side-by-side bulk experiments on playground different permutations and combinations of prompts and models to identify the right prompt-model combination for your use case.

  3. Run tests on large test suites and simplify decision-making by comparing output quality, cost, and latency across different combinations of prompts, model, and model parameters

  4. Deploy prompts with different deployment variables and experimentation strategies without any code changes, enabling teams to seamlessly execute prompt A/B testing