I have a bot here that I’m measuring for reliability (not accuracy). Before I continue - yes I understand at 0 temp it will basically be deterministic and have a very high reliability rate, we know this as developers but I have to show that to anyone whom might be interested in what I’m creating.
The bot takes in different transcriptions and scores them on different metrics.
So how many tests is enough to say “okay this is reliable”, 100, 200? What is like a good set of data?
Sorry if I’m not putting this in a more educated way.
The way I’m testing is I’m using the same audio transcript over and over and comparing the margin of error
Braintrust is a great tool/platform for running evaluations like this. You log to it using the Python / Typescript library and then get a web UI experiment view for digging into what test cases improved or got worse. Makes life so much easier. You can also use it to manage test cases, datasets, and a prompt playground. It’s free to use right now: https://braintrustdata.com/