Exactly. And unlike ChatGPT, OpenAI has no incentive to minimize the conversation loaded into the AI every iteration. They limit gpt-4-turbo output to 4k because the generation is what actually costs money and CPU time, not the loading of $1 billed conversation into it.
OpenAI doesn’t describe any techniques such as an embedding database that could extend the illusion of memory, but they do say they’ll truncate only when conversation won’t fit into AI context.
You could pull down the thread occasionally, truncate it by token count, and send it back to a new thread, to not spend 16k (or 128k) every question, but then what’s the point of their system anyway?