How to test an API, built on GPT?

I build an api that generates responses based on the responses received from GPT. It’s a simple api that builds a prompt based on the submitted parameters.

API + param’s → prompt building → GPT → json respons from GPT → building API response

I am testing various different approaches, wanting to find solutions that give the best results.

Unfortunately, it is difficult to assess the results of these changes. Currently, I generate and evaluate the results manually.

I would like to write a set of tests, specifying input → expected output. I think that with such a solution I could more easily assess the quality of the changes made.

What solution do you use?
Is there any tool / library to test the answers from GPT?

Welcome to the community!

I recently saw that openai launched a comparison tool on the playground: https://platform.openai.com/playground/compare?models=gpt-3.5-turbo&models=gpt-3.5-turbo (haven’t tried it yet tho).

Notice: this post doesn’t actually answer your question “how do you empirically test the API?” - I’m offering a different (analytical) perspective on the matter if you’re interested. I can just answer this:

I suppose it helps to separate that endeavour into different tasks:

  1. prompt eval
  2. model eval
  3. parameter eval

1. prompt eval

This is tough. While I think what you’re suggesting is a good idea (prompt science), an engineering approach is often good enough.

While you could empirically evaluate your prompts, using theory to construct prompts is probably faster* and cheaper.

*I realize that it may require a significant amount of experience I might be discounting, but it feels like it’s not that complicated

Theory like using JSON/XML to control attention over longer periods, understanding the limitations of model attention, using principles like guided thought (guided chain-of-thought), understanding positional instruction strength, sequence termination pressure, etc.

I feel like using these techniques can get you to a gold-standard solution, if you use the appropriate model.

2. model eval

To catch model regressions, we have a (short) thread over here: Approaches for monitoring quality of reasoning capabilities in production

Figuring out the cheapest model you can get away with for your application a-priori through trial and error might be tough. I gave up on that, because you’ll try to spend a lot of time trying to come up with edge cases that may not even reflect actual usage.

Instead, I analyze the task and use this formula:

  1. gpt-3.5 (version doesn’t really matter) for very basic text transformation: while you can get decent results for a lot of tasks, I wouldn’t trust it with much. “:/”
  2. gpt-4 1106 or 0125 for tasks requiring maximum focus/stability: not as smart as 0314, guardrailed to death, but incidentally less prone to hallucination. “normie”
  3. gpt-4 0314 for tasks requiring attempts at reasoning: more prone to hallucinations, but also more capable at solving convoluted issues. “beautiful mind”

But to be honest, I haven’t figured this out empirically. I’d chalk this down to “experience” playing with the models. I’ve found that various “empirical” leaderboards don’t really capture the nuance between the models. I’d rather characterize them by their idiosyncrasies and then pick what’s best. Since new models will likely have new quirks, instrumenting all these dimensions would be quite difficult.

3. parameter eval

I would keep temperature and top_p at zero, for almost all tasks.

For a dedicated brainstorming task, I’d set top_p to 1 and temperature to 1, maybe 1.2. If you’re doing brainstorming tasks, optimizing this empirically might make sense.

But if you’re not doing brainstorming, I don’t feel it would make sense to play with the token lottery. Temperature raises the median token probability, and top_p defines how big your token bucket is.

You can sort of think of it like a centrifuge.


src: Centrifuge - Wikipedia

You have your big LLM centrifuging the token vial. The lightest tokens rise to the top.

Temperature tells you how much you shake your vial after you centrifuged it.
Top_p tells you how many of them you scoop from the top.

Empirically evaluating out how much you should shake your vial after centrifuging it, does that make sense? :thinking:

It obviously depends on your use-case, but in mine I’d typically go for stability over chaos. There are certain scenarios where this has really thrown people for a loop. One example top of mind is where people got single quoted JSONs ~20% of the time, which broke their parsing pipelines. It’ll be less noticeable if you just want to generate chat output, but I don’t find it useful for putting the models to work.

1 Like

Thank you very much for your answer :slight_smile: I learned a lot.
I did not know that this model has such a different “personality”.
A very good comparison with the centrifuge. I think this illustrates the meaning of these parameters very well.

1 Like