Token efficiency in context injection

Dear community,

Im attempting to give the gpt-4 chat completion model some information before prompting by feeding it a variable called context that consists of the same 200 concatenated short sentences before each prompt, something like this:

        for prompt in prompts:
            response = self.client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": instructions},
                    {"role": "user", "content": f"{context}{prompt}"},
                ],
                temperature=0
            )

But as you can imagine, the costs of this are pretty high.

Just wondering if you’d be so kind to suggest options to make this more cost efficient? Thanks!

1 Like

Hi and welcome to the Community!

Given that the API is stateless and every API is treated independently from one another, you don’t have a choice but to provide the context everysingle time.

However, depending on the specifics of what you are looking to achieve, there may be alternative options/approaches you can consider that may be more cost friendly. If could share more details or an example of your typical system and user messages, then we can see if that applies.

I came here to ask the same thing, I have a decent chunk of context and then a tiny bit that varies each time and will send hundreds of these through.
I noticed that google Gemini offers context caching so I am going to look at that.
The other thing I was thinking of is to batch process, i.e. give the context and then give an array of the variable bit and ask for an array to be returned although I haven’t tested this yet, I am slightly concerned that it will mix up elements of the array.