Prompt Evaluations at Scale for Production

Hi everyone,

I wonder how everyone is comparing their prompts when building for production. For example, say you make a small change to your prompt and wanna see how it behaves overall (So need to test this n times). Except for writing a script, what are other alternatives? I’ve heard of solutions like LangSmith but I’m not sure how useful these tools are and how widely they’re used.

Thanks,
Raivat