That is the only way to go. If you want to save tokens, you could generate summaries of previous conversation instead and send that one. Or you could have last question and answer in full and additional 3 like a very short summary.
It does not. ChatGPT’s context window is also finite. None of these models has an infinite context window. It only looks infinite in the UI.
You can use several tricks to deal with this limit: summarization of previous conversation, semantic search to retrieve relevant passages, add scratchpad capabilities, etc. But you cannot bypass it.
If I recall correctly, you can actually check this out in ChatGPT interface easily. Just try to send a single huge text message (more than 30k characters or so) and ask: “What is the last sentence of the text?”
I’m not sure if I understood well. The intent of my question seemed to keep accumulating in the GPT window even though prompts and completions piled up a lot.
As you said, it was confirmed that an error occurred when more than 30,000 characters were entered in one prompt.
What I’m trying to show with this example is that the context window of ChatGPT is also limited (even for one single message). So it does not infinitely remember previous utterances. ChatGPT might be using some of these techniques under the hood to optimize its context window usage, but this is still finite (some people mention the use of scratchpad capabilities, but this hasn’t been confirmed officially).
AI solutions have constraints like other software architectures. You need to build more software around the LLMs APIs to create solutions that address these more advanced ideas. Your envisioned requirements can be met, but it involves the assembly of potentially many technologies, including but not limited to:
Databases
Embeddings
Completions
Client storage
Cloud storage
Rules engine
LangChain
OpenAI’s apps meet many requirements, but there are ceilings of operation just as in any software system.
In the API, max_tokens is set to 4,096 based on the GPT-3.5 model, but there is no problem with GPT-3.5 on the web (window) exceeding 4,096 (based on all sessions).
I’m curious about the reason for this. I checked that it should not exceed 4,096 tokens in one prompt, but I don’t understand why the token limit is not applied no matter how many questions I ask.
I don’t know. Good question. My hunch is that they are doing some additional things under the covers (i.e., they have employed some other functionality beyond their API limitations).
Langchain is an open source framework that will give your GPT API the appearance of memory, similar to the Chat GPT experience. They have good examples that you can learn from to build your conversational app.
You can further extend the memory by using indexed embeddings with PineCone or LLAMA Index to effectively “compress” previous prompts and completions into vectors which GPT can interpret so that there will be ongoing context. (The indexes are smaller than the text which saves tokens and extends the memory.)
Langchain is relatively easy to implement and would be the best way to get started, implementing indexing is a bit more involved but it is doable with some care.
Coming from a medical perspective you have to think of it as concepts. When someone asks a question we often can’t remember every detail of a 20 minute conversation but we remember the main points (and sometimes forget some others). I think one of the easiest ways is to continually ask it to build a summary after every few requests and feed that back to it after every few requests along with the next prompt. May not be the perfect solution but comes closer to building long term context.