Standard logging/recording for API requests

I’m just in the process of coding logging/recording for openai requests so that I can store requests, responses and associated metadata etc. because I want to track results of my experiments, and don’t want to pay twice for the same request, or wait again for requests I’ve already made. So I was wondering if there was any existing library or framework that does this so that I’m not re-inventing the wheel?

I was asking copilotX which said that I could use the gym monitor, but I think that might be a halluciination. Gym is now Gymnasium and was/is all about recording reinforcement learning.

github → Farama-Foundation/Gymnasium

GPT4 says it doesn’t know of anything (although I haven’t asked with the Bing plugin)

The code I’m developing (in case my objective isn’t clear) is looking like this:

    with open(f"{path}/{record_id}/request.json", "w") as f:
        f.write(json.dumps(request, indent=4, sort_keys=True))
    with open(f"{path}/{record_id}/result.json", "w") as f:
        f.write(json.dumps(result, indent=4, sort_keys=True))
    with open(f"{path}/{record_id}/text.json", "w") as f:
        f.write("{\n" + result["choices"][0]["text"])

naturally I can do code this out myself, but I’m imagining a framework that dumps this sort of stuff out, but also allows one to refresh old requests, gives options on which metadata to output etc

Even if there is no existing lib, does having something like this make sense? Or could we use something like VCR.py? Or other python logging library for apis?

One thing to note about caching model calls (which it sounds like you’re trying to do) is that given both the stochastic nature of the models and the fact OpenAI re-trains them daily, you can’t be guaranteed the model will return the same response on the next call.

Additionally, if your prompt changes even slightly context wise you’ll likely get a different response. For example, asking “what do you think?” in the middle of a conversation will result in a radically different answer then asking “what do you think?” at the start of a conversation.

Just things to keep in mind. If you’re experiments are super controlled and you’re just looking to save some cost then I say cache away :slight_smile:

Hi Sam,
I think you are looking for a tool like MLFLOW which can do experiment tracking while you try out different combination of hyperparameter and prompts so that you can identify the leader technique. I did some digging sometimes back, I came to know tools like MLFLOW are not yet equipped to handle LLM experiments(in future it may have). So, i had to write that entire logging and tracking framework myself using Python

thanks for the tips folks - I just found github preset-io/promptimize which might be what I was looking for