Hi all, since token limit has grown so much, any chance we’ll be able to send ‘diff’ requests instead of full requests? So instead of a 3000 token request, i’d just send the delta from previous?
Now that we’re dealing with chat models, that receive information in “messages” list, it could be great to just send the appended message, and not the whole thing
Welcome to the community.
iMO such functionality will require a storage solution on behalf of OpenAI. I think you can implement it on your end by creating a proxy with a custom storage solution giving the session an ID to implement it.
Also, chat completion endpoint can also be used for completions as well, implementing a diffs infra would cost a lot for storage and for no good reason. Not to mention the fact that it’ll add to the token cost.
The objective would be reducing roundtrip time by not sending the entire package every time. If i don’t need those tokens, I’d just send another message
So implementing this on my side won’t help too much
By far the most time spent is on inference, not sending data. Also, with a ‘diff’ scheme, I’m not sure how I would control the history that the AI is sending through the transformer network. What if I wanted the AI to suddenly forget something in the recent past? For example, the topic changes. Not controlling this would drive me nuts!
This is all true, but use cases vary. Also, there’s a number of ways to allow diffs, starting from purely API with caching on OpenAI side, which wouldn’t reduce inference time but would reduce the call time nonetheless, and also through saving partial states in the encoder, which would definitely save some inference time.
So there are approaches to reduce inference time as well.
A less naive approach could be detecting the similarities in prompt on OpenAI side, rather than user side, and just reuse the weights on sequential calls, to some degree.
As for use cases, I see quite a lot of ppl doing chat agents with a large window of context. So the window is obviously sliding, but one could design a simple architecture to slide the actual window in jumps, and if diffs are possible, reduce the roundtrip if that were possible.
Not trivial implementation, but for chat applications I think it could be useful.
This is actually a huge problem, but I think it’s likely unsolvable due to the nature of LLMs.
I’ve tried a number of things and my conclusion is that reasoning degrades significantly when tasked with thinking in ‘diffs’. They work much better when they can do the next word prediction on complete snippets.
If anyone has found a way to make this work as well as regular prompts that would be great. Love to be proven wrong here because the lack of diffs in chatting/utilizing content with LLMs a lot more complicated.