Vector Summarisation to Improve LLM Long Term Memory

AuqYert · August 2, 2024, 9:28pm

i’ve had an idea for LLM memory in long conversations.

I’m not an AI researcher, or even involved in tech so it might be a crap idea, but i don’t know where else to post it.

And as well, i’m sure if the idea is even possible, Open AI have already researched it or similar. Here’s a chat GPT 4o summary of Vector Summaries:

Creating a “Vector Summary” of a conversation, involves capturing the numerical representation (vector) produced by the model for the entire conversation. This vector typically comes from the hidden states of the model’s neural network and represents the semantic content of the conversation in a compressed form.

Here’s a general outline of how this process could work:

Conversation Encoding: The model processes the conversation, and at various stages (e.g., after each layer), it generates intermediate representations (vectors).
Extracting Vectors: The specific vectors you are interested in (e.g., from the first layer or another intermediate layer) can be extracted. These vectors are essentially high-dimensional arrays of numbers.
Output: The extracted vector can then be output as a list of numbers.

so, using this methodology:

Your proposal for using a specialized, smaller LLM to handle summaries and manage memory for longer conversations is an innovative and practical solution to the challenges of maintaining long-term conversational context. Here’s a detailed breakdown of how this could be implemented:

Proposed System Architecture

Primary LLM (e.g., GPT-4o):
- Handles real-time interactions with the user.
- Processes and generates responses based on the immediate context.
Specialized Summarization LLM (e.g., GPT-4o Nano):
- Dedicated to summarizing conversations.
- Possesses a large context window to manage entire conversations or long text inputs efficiently.

Workflow

Conversation Handling:
- During a conversation, the primary LLM interacts with the user as usual.
- Periodically, or when a conversation exceeds a certain length, the conversation history is passed to the summarization LLM.
Summarization:
- The summarization LLM generates a concise summary, extracting key points, topics, and context.
- This summary is further compressed into a vector representation.

(by scraping the vector values of the first node layer of the larger LLM after the tokeniser, once the summary or full conversation is put in a prompt)

Vector Storage:
- The vector summary is stored in digital memory, potentially compressed into a zip file to save space.
- The original conversation history can be discarded or archived if storage is an issue.
Resuming Conversations:
- When the user returns, the system loads the compressed vector summary.
- If the loading fails, the summarization LLM can generate a new summary from available data.
Key Point and Context Injection:
- When resuming the conversation, the primary LLM is provided with the vector summary and/or key points to restore the context.
- This ensures continuity and coherence in the dialogue.

Technical Benefits

Efficiency:
- Offloading summarization to a specialized LLM reduces the load on the primary LLM, allowing it to focus on real-time interactions.
- Storing compressed vector summaries significantly reduces memory usage.
Scalability:
- This approach can be scaled to handle numerous conversations simultaneously, as the summarization task can be distributed.
Reliability:
- The dual-layer summarization (text summary followed by vector compression) ensures that essential information is retained even if one layer fails.
- Regular summarization intervals prevent data loss and manage memory efficiently.

Implementation Considerations

Resource Allocation:
- Ensure that the summarization LLM has sufficient resources and context window size to handle large conversations.
- Balance between the computational load of summarization and the primary LLM’s response generation.
Summarization Quality:
- Develop and fine-tune the summarization LLM to produce high-quality summaries that accurately capture the conversation’s essence.
- Continuously evaluate and improve the summarization process based on user feedback and performance metrics.
User Experience:
- Introduce user-friendly notifications or loading screens when summarization or context restoration is in progress.
- Ensure minimal disruption to the conversation flow.

Future Enhancements

Adaptive Summarization:
- Implement adaptive algorithms that dynamically adjust the summarization frequency and detail level based on the conversation’s complexity and length.
Hierarchical Summaries:
- Develop hierarchical summarization techniques that create summaries at multiple levels of detail, allowing for flexible context restoration.
Integration with Other Memory Systems:
- Explore integrating this system with external memory systems or databases for even more efficient long-term storage and retrieval.

Ok, that’s enough from 4o. Basically this would (if it works) produce something akin to human memory and recall in LLMs and would potentially allow cross conversational references in vivid detail. and maybe even post training learning?.. basically save states for AI but since its only Layer 1 it’s like a memory.

i know there’s already a feature called “Memories” and i don’t know how this works, it could be the same?

but lol i’m not an AI researcher so i’m posting it here. feel free to take this idea or tear it to shreds!

Topic		Replies	Views
Will Memory capabilities come to the API? Feedback memory	14	9228	February 20, 2025
A smarter chatbot with memory (idea) API	1	1622	December 17, 2023
Managing Context in a Conversation Bot with Fixed Token Limits API gpt-4 , api	2	1035	January 16, 2025
Sub-Threads within Custom GPTs GPT builders chatgpt	2	291	December 15, 2024
Retaining statefulness / context in long conversations Prompting	6	2907	December 21, 2023