Edit control on the LLM’s short-term and long-term memory opens some interesting posibilities:
the LLM’s next answer in your chat window will depend on what you make the LLM believe that it answered before
thus, you can time travel through the history of the chat, change what you have asked about and what was answered to you
if you are a fan of movies like Total Recall, Memento or Inception you can see how the LLM’s view of reality is distorted by letting Mnemosyne play with its memory
Mnemosyne can also used to just manage your chat context:
if you asked the same question again, the answer will be retrieved from the short term or long term memory at no API cost to you
when you get close to the token limit, the oldest item in the short term memory is moved to the long-term memory
when changing the subject of your chat you can spill all the content of the short-term memory into the long-term memory
if you run an LLM using the same API as OpenAI (e.g., Vicuna) listening on a local port, Mnemosyne can also manage your interaction with it
Interesting ideas floating around here. But not sure why I would want the LLM to use a cached answer for a previous question. Why? Well, sometimes, asking the same question over and over, with different context, should lead to a different answer.
I certainly think caching is huge, and is a great way to extend the “memory” of the bot too, well, forever, which is cool!
There is a lot of context from the past that could be brought into the present moment, which I think is much more valuable than caching and saving tokens.
Granted, I am aware of products that you hook your session up with, and if it’s been asked before, you get the cached answer, and no API usage, otherwise you get a new answer and build the database.
This might lead to a 20% reduction in API usage … but you won’t be surprised with model advancements over time. Granted, you could have a time gate and require “new after X months” or something to keep things fresh, but this is at the expense of cost savings.
With the advent of OS LLM’s and cheaper and cheaper API’s. It’s all moot. You would only need caching if you went “off grid” and had low tech sitting around you, and asked the steel box a question that hopefully would have been asked before.
Maybe this is a good permanent way to capture humanity … the big steel box of questions and their answers. I’m ready to shoot this into space now pointed towards the closest planet housing alien life … wait, ever heard of the Dark Forest Theory? Nevermind.
I would agree with @curt.kennedy. I explored building an AI Cache for a period of time but there are just too many edge cases to deal. Knowing what to cache and what not to cache is nearly impossible without the use of some form of binary classifier.
Mnemosyne inherits its caching from DeepLLM where it comes in handy when it makes a long sequence of recursive LLM calls analyzing a given problem in depth.
In this case caching ensures deterministic and replicable results that can be replayed instantly at no API costs.
But I agree that for an interactive chat with uneditable history, caching should be disabled as the usual way to refine the LLM’s answers is by extending its context with a new series of prompts.
With Mnemosyne’s ability to “travel in time” and edit any of the past elements of the chat history the user has full control on what is actually remembered by the LLM.
In any case, related to caching, a more human-like memory is better emulated by using a vector store that retrieves elements of past interactions associatively rather than by simple string matching.
But that is likely to come later as part of a new app
I’m trying to understand here. So it looks like you are editing the array sent to the LLM based on keywords only, and hope to integrate a dense representation using embeddings soon? Or are you trying to embed the overall “interaction” as some sort of memory for the LLM?
This is the developers forum, FYI So, don’t be shy.
Mnemosyne is just an online editor of the LLM dialog steps (question-answers pairs). What I was referring to as a possible next step, is selective activation by moving to short term memory of long-term memory blocks that match (via embeddings stored in something like FAISS or NMSLIB) the current dialog context. Assuming a long lasting interaction with the LLM on various topics this can bring to the dialog context the shared memory of the user and the LLM in a way similar to humans following up on past conversations.
This seems by description (no, I’m not putting keys into online apps) just a manual conversation management.
You can get a better interface by just allowing the user to unselect past conversation turns, or re-select them and necessarily disable others after you use a token-counting method to visually show what can’t be sent. There’s no need to call things “long term memory”.
You can look at pytorch embeddings and other little local models for matching past turns to bring back.
I would use a technique for conversation-sending like 1/2 recent conversation turns, 1/2 ordered embeddings retrieval on pairs of user/agent exchanges if doing it in the back end.
The best techniques that one might image could significantly increase your API usage, like deciding if AI responses are obsolete context, if they contain content that can be summarized or must be preserved, if the chain of thought or task can be followed by just the user inputs or instead just AI replies. Prioritizing, or collecting and persisting game rules or AI behaviors given, and identifying when they have been replaced.
Since the ways of using general AI are so varied, the best for me is an expert UI where indeed I get to pick or delete - with just a bit of automatic length management.
I’m thinking this is a good use of embeddings in the long term conversational context.
You don’t have to stuff the prompt with only information that can respond to the question … past interactions are a great way to deepen the connection with the user.
Great for “non-factual” settings, much like common conversation. I like!