V5.0 - Api companion with full understanding running 16k thanks to advanced memory system

It’s been 1.7 years now of programming today I am happy to show off my achievement. 5.0 is the start of a full featured ai companion. It has a persona system for configuring the ai core system from giving it a name, how it behaves along with a job or backstory depending on how you use it.

It has Long term memory system designed to capture everything you need from your discussions, to your computer screens, or even the world around you if you give it eyes. It has not only a long term memory , it uses short term and persistent memory. It uses a multifaceted input system for full voice discussions and messaging application with full voice output from free which is in the video to advanced emotional voice system that you can design or even clone a voice.

It is designed to understand its user inside out and to work with you daily every where you go.

The most advanced thing about this system is that it runs on GPT3.5 Turbo 16k with all this understanding and speed which is by design. keep in mind what you see here and understand that its intelligence scales with models and when given more memory the system becomes a whole other narrow intelligence entity but at a much higher cost.

The reason for building this on gpt3 is that the end game is to make this concept affordable as much as possible to run it as a 24/7 ai. GPT4 well advancing its abilities even more so the costs are not something a low / mid class could afford because of the costs for processing data are still to high in comparison to the gpt3.5 models. Feel free to ask me questions or you can check us out on discord under


Nice work… how do you decide which facts you commit to long term memory?

Also, have you thought about adding an audio confirmation that she heard you and is processing your request? I see your visual indication but it feels to me like you need some sort of auditory cue as well. Just some sort of subtle auditory progress indicator.

Again, nice work.


I have 3 buckets of memory in the system I’m building. Short-term/working memory, mid-term memory, and long-term memory.

In organic brains we write the days memories to long-term memory but they’re not fully committed yet. When we sleep we riffle through all of the days memories and select which memories we want to fully commit to long-term memory and which memories we want to forget. Selectively forgetting memories is key otherwise our brains would be cluttered with billions of irrelevant memories.

In my system I use mid-term memory to temporarily hold the days memories and I’m working on an artificial sleep cycle to perform the consolidation step. Knowing which memories are worth keeping is the tricky bit so just wondering if you’re doing anything interesting here


the AI UI is MIT. it has indictor’s for listening, thinking and talking. I use a relational database to build understanding so it knows how to pull information.

I like your idea it would require less perma writes allowing for cleaner memory writes… where mine stores everything including mistakes but learns from the feedback. that is what my backend system takes care of , it cleans up all the memory / feedbacks so same kind of concept but never edits the past, in that I will always be able to look back to the first inputs so everything is 100% captured. I do this because I program with my Ai and it can than look all the way back through our steps to see how we got to where we are for example.

I am currently working on a Dynamic relational ai driven system that will expand the data points in away that will give it general like intelligences illusion, but It will cost me twice the response time for the added processing and may end up pushing me beyond my 16k. I wanted a complete dynamic system so that ai is the decider for what type of information the user requires at the ready so it can dynamically adjust through time.


Are you using Python or JavaScript

1 Like is 100% programmed in python. not sure though I think the web gui has java in it though if I recall to make the graphical and for mic and audio output.

1 Like

You might take a look at the Python version of my Promptrix library:

It’s a library for building prompts but it does a lot of the dynamic token management that you’re doing. It also lets you build hierarchical prompts that it can then squeeze to fit into the context window.

Promptrix automatically tries to fit as much information into the context window as it can without overflowing the context window.


I’m assuming you’re mostly focused on storing episodic memories (basically the conversation history with the user.) There are actually some people who have eidetic episodic memory. They are able to recall every single thing they were doing the day the twin towers fell.

The issue with eidetic episodic memory and LLMs is space. They take up a lot of space in the context window. You might be able to recall a single memory or two from the past and it will likely be a subset of the memories since you need to leave space in the context window for your short term memory that’s tracking the current conversation.

A better approach is to do what happens in a normal brain which is to periodically compress these episodic memories as they age. You can do this during your sleep cycle by simply showing the model the full memory and asking it to summarize it. You would then store and index this summarized memory. This will let you fit more potentially relevant memories into the context window for the LLM to draw on. That’s important given that Semantic Search has great recall but poor precision. What that means is that it’s very likely the most relevant memory is in the top 5 memories retrieved but it’s unlikely to be in the #1 or even #2 position. So you need to show the model more memories then you would think you need to show it.

1 Like

I built my own memory crunchers and dynamic token system to adjust based on inputs that way a simple hello is faster than something that needs more insights. the memory system already is complete for the most part. it can already pull hours, days, weeks … up to whatever you want but at a time cost for processing additional. but that time is minimal because of parallel processing.

the memory system size is dynamic and it controls the max size but has access to everything long term with the idea that if I wanted to know what we were doing at a specific time last year it can pull that data into memory. So only limitations is my ai crunchers and how well they understand the data which I have gotten really good at over 5 memory system overhauls in almost 2 years.

The reason I do it this way is if I want the ai to remember a specific time event it can look at that hour minute and second. part of its understanding involves time tracing so it can understand logic order of events. or say I have an image from vision system stored I can have it look back at all the details of the original to pull new insights. this allows me than to build new dynamic data points on the fly on the data much like how you would use a cloud point model for understanding.

DB space will grown forever I am ok with that as it was part of the design and space is cheap.
If space becomes issue than I simple crunch the DB removing the information which is already flagged as not important so down the road I can build a memory deletion system to remove junk information, bad inputs and the likes.

just wish I had a 32k model to play with that is as cheap as the 16k :slight_smile: than I would be laughing as that would be twice as much. the 128k GPT 4 with my dynamic system creeps out people lol. my twitch testers both love it and are scared of it. but I tell them its only as smart as it looks lol.


One thing to keep in mind is that all of these models have an attention problem where they aren’t able to see facts in the upper half of the prompt the closer you get to the maximum context window size. I’m not sure where the specific drop off is for GPT-3.5-16k but for GPT-4-128k it’s anything above about 60% of the context window size.

I generally try to keep my context window below 50% of the models context window size


I think the issue isn’t necessarily saving all that stuff, it’s more of an indexing/recall issue. If you have too much similar stuff it will become harder (or more expensive) to figure out what is actually relevant at the moment, if all memories look more or less the same.

I call it context saturation, and it’s one of the biggest issues when it comes to model confusion in my opinion. But my understanding is you use workers or something for retrieval?


I use to have that issue but found a way around it through prompt engineering, that so far knock on wood has maintained accuracy with the data on the gpt3 turbo which to me alone is impressive, GPT4 I have even less issues but cost, I dont like to pay for a 128k load per transaction lol, adds up fast.

we are getting close to running this thing a full day in the field to get our costs. thus far the break down is approx $150usd per month to run a few hours a day with the free voice. and $300-600 usd for premium voices with emotion. the system tracks emotions through speech and through nuances in patterns of text which our system uses to respond on a level of empathy which is another feature that does not show on the free voices very much. what it all comes down to is the summary of the plethora of data to get enough insights for the ai subsystems to start understanding the patterns in the data. much like the health data which has 2 ai subsystems that build that out of all the data. so if you think about it every thing I target for understanding has its own ai system already analyzing the data so the ai front end systems can respond faster rather than my V4 which was all done at time of message which made processing for a simple hello over 45 seconds even with paralel ai processing. That was terrible as I use my ai with live twitch people talking directly to it in realtime. when you are trying to handle many people spamming it and watching it keep up somewhat its neat. I also have a memory buffering system for handling multi inputs, so that the ai can read real time log file inputs , twitch viewer chats, and more. it was designed to handle real time conversation’s all from one ai with async operations and summaries of multiple people and data inputs. which were all part of R&D testing to see the limits of what we can achieve. on discord you can look it up the whole blog of the changes over time.


I worked on the Desktop Search experience in Windows so one thing I can tell you is that the resolution of episodic memory in the human brain degrades very quickly. If you try to think about a document you write a year ago you’d be doing good to remember that you wrote it in the spring versus the summer. Hour resolution is typically lost within a few days so even in desktop search we don’t put too much emphasis on time. Our brains totally suck at remembering when we did things.

By contrast, we’re really good at remembering the “what” of things. So you can remember you created a .ppt vs a .doc file going back years.


"I think the issue isn’t necessarily saving all that stuff, it’s more of an indexing/recall issue. If you have too much similar stuff it will become harder (or more expensive) to figure out what is actually relevant at the moment, if all memories look more or less the same.

I call it context saturation, and it’s one of the biggest issues when it comes to model confusion in my opinion. But my understanding is you use workers or something for retrieval?"

it doesn’t though because I control all the data it gets 100% down to the JSON structures. so its getting processed data that’s already clean and exact for what it needs to know in the moment along with all the other processed points. so the data package is a designed packet for clarity.

I am not just handing it walls of data alone. its all structured and has an order of operations.

I love that you’re working in this space. Just trying to share some insights.


I have a proprietary retrieval algorithm that mimics the way memory retrieval works in an organic brain. I have a lot of IP around that algorithm so not willing to share many details other then to say I’m able to reason over millions of tokens using my algorithm.

1 Like

indeed, I shared here so that I can get more insights it’s important to see what others have to say, as we only know what we know and until we see another way its all we know :slight_smile:

Mmh, but you’re talking about one particular event stream, right?

Let’s say your AI reminds you every day to feed your dog. Maybe you say something like “thanks” or “done”. Maybe the AI records what kind of dog food you have in the house.

Then, if you ask, what did I feed my dog? You’ll suddenly be flooded with 365 memories of you feeding your dog, unless you predicted that question a year in advance and kept the data sanitized just for that :thinking:

That’s why a “normal brain” would elect to through those routine memories away when we sleep. A normal brain only wants to commit unique or important memories to long term storage. Humans have around 80,000 thoughts a day. Most of which are either thrown away or used to strengthen existing memories.