**CLOSED** Separate ChatCompletion API calls for 'system' and 'user'


Is it possible to do separate ChatCompletion.create(…) API calls, first with ‘system’ content to define the context, and than multiple separate individual ‘user’ calls to the model with initially set ‘system’ context?

I understand LLMs are static and stateless and memoryless, but API can have a state which could be managed server-side. It’s just a question if it’s supported?

The idea is to avoid sending large ‘system’ data with every call, it seems redundant if the API context instance can be handled, even knowing LLM is stateless and static.

Or, if this is not possible, is it possible to reference uploaded content file, there is no reason not to support this regardless of LLMs inner workings, e.g. like:

{“role”: “system”, “content”: content_file_id}

Thank you

Unfortunately that isn’t how ChatGPT works. The model itself doesn’t remember past entries- that’s why you have to provide the context with every message.

Referencing an uploaded file containing the context would just result in the same issue. You still have to pull in the file contents and then it’s the same number of tokens.


The question is if API calls could make continuous conversation on a previously set context, like “GUI chatGPT”, regardless of what the underlying model remembers?

Regarding API call costs, the tokens would be the same, but I would have less upload traffic if there are hundreds of short queries following a large context.

That is what the messages field on the API call does. It outlines your previously set context.
messages = [
{ ‘role’: ‘system’, ‘content’: ‘system_message’ },
{ ‘role’: ‘user’, ‘content’: ‘message_1’ },
{ ‘role’: ‘assistant’, ‘content’: ‘response_1’ },
{’ role’: ‘user’, ‘content’: ‘message_2’ },

1 Like

My example is cca 500kB of ‘system’ content, and hundreds of interactive prompts, so I can’t make one batched call as stated above.

The question is do I have to send/upload entire ‘system’ context, and previous prompts hundreds of times, i.e. after every interactive user prompt?

I have a working code, just would like to reduce the traffic by orders of magnitude if possible.

call #1
{ ‘role’: ‘system’, ‘content’: ‘system_message’ },
{ ‘role’: ‘user’, ‘content’: ‘message_1’ }

call #2
{ ‘role’: ‘system’, ‘content’: ‘system_message’ },
{ ‘role’: ‘user’, ‘content’: ‘message_2’ }

call #3
{ ‘role’: ‘system’, ‘content’: ‘system_message’ },
{ ‘role’: ‘user’, ‘content’: ‘message_3’ }

call #4
{ ‘role’: ‘system’, ‘content’: ‘system_message’ },
{ ‘role’: ‘user’, ‘content’: ‘message_4’ }

with same ‘system_message’ sent every time hundreds of times.

Yes, system message needs to be sent on every request. There’s no way to “store” system messages or conversation history with the API.


Large language models are memoryless, they cannot retain information between calls, everything needs to be recomputed every time. That is just how they work.

1 Like

The question is not about LLMs inner working, but about data management, upload traffic and API calls.

There is no reason whatsoever not to support uploaded file reference for a ‘system’ content entry, it is just not supported.

Hey Hrvoje,

The reason it won’t work with 500kb of system content, is the limited context window of the model itself. The way ChatGPT works is by having the sliding memory of your conversation until it reaches X amount of tokens. This is what you will have to do for system content as well.

The model is taking in the entire content just as 1 giant prompt pretty much. On the server side, there is no memory cache that the AI is pulling from. It is just getting the data appended together and making a prediction of what the next text should be.

So, you will have to get creative in how you reduce or break up your system content. I’m not sure about what your project is, but you’ll have to go through several rounds of prompts and then get an aggregated response at the end. I’ve talked about a way to do this here: How to prevent ChatGPT from answering questions that are outside the scope of the provided context in the SYSTEM role message? - API - OpenAI Developer Forum

Additionally, You should look up something called “Expert Routing” in Mixture of Experts. I’m seeing a lot of people with some requirement like this and this may be the best option for you as well. Essentially, you will just train a very small network on which expert the prompt should be sent to. If you don’t feel like training it, just use a routing chatbot to determine which model the prompt should go to. In your case, the “model” is just an agent with only a portion of the system content loaded.

For example, if you need a bot that can suggest meals, wine pairings, and movies, just make 3 different bots with instructions in each one, then route the prompt to the correct one based on the first routing bots classification.

Is the user talking about:

1. Wine
2. Food
3. Movies

Please respond with only the number. If none apply, please respond with 0.

Lastly, if it is appropriate, look at pulling in data from like chromadb to be your system content rules. Like I said, though, you will have to get creative to get around the context length. There is hope on the horizon, but for now the content size is pretty limiting. Just think of it as GPU usage. Doesn’t matter if you need more, there is only so much memory you can use, so you have to figure out how to utilize it in a more resourceful way.

Whatever your solution, you should make another post and let us know what technique you used. I know the community always likes creative solutions.

Good luck. Hit me up if you have questions.


In the days of streaming Netflix, internet data is not a concern.

You’re sending a megabyte of javascript user interface but can’t send 500 bytes to an API?

The content the AI must process must be loaded into the AI engine context for every call, and then inference processing cost based on the amount of input. Your billing for input tokens is due to supercomputer-like massive cost of compute, not for the transferring of data.

They could have a “store system prompt #” and “use system prompt #” API interface, but it would likely just result in more developer confusion, and not save you a penny.

With respect to API calls, there shouldn’t be any difference in the number or complexity of API calls regardless of how the system prompt and chat history are handled.

With respect to data management, I understand there may be some convenience associated with offloading the responsibility of maintaining the system prompt and context window to OpenAI, but there are a number of pitfalls with that approach.

First, that requires OpenAI to maintain chat history for the API, something which many users expressly do not want them to do for safety and security reasons.

Next, you would need a way to uniquely identify each API chat which brings with it essentially the same issue you are trying to avoid—needing to manage data. Though it may be simpler to track a chat_id than the entire chat, doing so in either case should be a one-time problem to solve.

Lastly, if these things are maintained by OpenAI, you become limited in your ability to modify these on the fly.

With respect to the bandwidth, I completely understand how it might seem absurd to need to send a large system prompt with every invocation of the API, but consider what would be necessary on OpenAI’s side to implement what you are asking,

They would then become responsible for storing, maintaining, and securing millions or possibly billions of system prompts, for what benefit? If we include previous messages in this solution that number drastically increases. Bandwidth is incredibly cheap there’s little reason to be so concerned with it.

Let’s use your concrete example to clarify this. Let’s say we have 500kB of text, which is typically around 500,000 characters. This translates to roughly 125,000 tokens, which is about four times the context window of OpenAI’s most powerful model. If we consider the least expensive model, which charges $0.002 per 1,000 tokens, even if the model could process this much context, the cost would be about $0.25 each time just to process the system message.

On the other hand, let’s consider the cost of transmitting the data. If we use AWS as a benchmark, where the bandwidth cost is $0.09 per GB after the first 100GB/month (which is free), the cost to transmit a 500kB system message would be about $0.000045.

So, if you compare these two, you’ll find that the cost of processing a hypothetical 500kB system prompt is more than 5,500 times the cost of transmitting it.

I hope this helps clarify the matter. Let me know if you have any more questions!

@codie , I want to talk to you. I am Quicksilverai on discord.

1 Like

Ah, I see. I learned so much from you.
So, we have to use system context to every api call, right?

When I use function calling, I use different system prompts for the 1st API call to get the function call and for the 2nd API call to summarize the result. The second system prompt is the main system prompt of the chat bot. The first system prompt is just to nudge it to give good result for function call.

1 Like

File upload and referencing is already supported, and I use is for fine-tuning purposes, I don’t have to send entire data every time, e.g. you can notice training_file_id being used below:

training_file_handle = openai.File.create(file=open(training_file_name), purpose=‘fine-tune’)
training_file_id = training_file_handle[‘id’]

create_args = {
“training_file”: training_file_id,
“model”: “ada”,
“n_epochs”: 60,
“batch_size”: 3,
“learning_rate_multiplier”: 0.3

response = openai.FineTune.create(**create_args)

Same approach could work for large prompting.

I would say prompting context is usually orders of magnitudes smaller than large fine-tuning data, and while prompting one expects smallest possible delay, so it is not a considerable issue traffic-wise and could reduce performance.

In general, one way to reduce ‘system’ context (for both cost and traffic reasons), or even avoid fine-tuning the model, is to use embeddings to identify the most relevant sections of the context and use only those for ChatCompletion query.

More info in this interesting notebook:

@ahrvoje I understand your frustration, dealing with the same problem. In their GUI version, it is possible to establish one context, and then ask multiple questions. But their API requires you to send the context each time. This has made it a lot more expensive for me. I would ideally like to - establish context, ask a question, receive an answer, ask a question, receive answer, and so on. But I cannot do that now.

chatGPT works the same way

When you open some conversation it pulls out the whole text and sends it with your every question.

ChatGPT is just example app using the API
Just an app using the API which was built to handle conversations, pull them up and send them when needed.
Nothing special you cant do with the API

If you want to generate text based on previous text, GPT has to always process that text. (takes resources every time)
Nothing is stored in the API nor in the model.
Its just how it works

1 Like

CLOSED as the conclusion is reached.

If you have a solution, you can mark the contributor that helped the most.