stream=true (using the streaming API) with chat/completion APIs increase the output token limit? For example, models like gpt-3.5-turbo have a 4k token limit for both input tokens and output completion tokens. Does each chunk of received data/message in streaming count thru total token limit of 4k, or each chunk is treated/counted separately and can have its own token limit. Thanks alot. I assume it is total limit since it doesn’t mention anywhere in api docs or couldn’t find any related information.
Hi and welcome to the developer forum!
stream parameter to
true has no effect on the generated content or length, it is simply fed to you as it is generated rather than at the end.
The way these models work, the token limit is the size of the area-of-tokens in the GPU, where the model both reads previous text, and generate the next token. (and then next, and the next.)
For technical reasons, making this area big is quite expensive.
The generation of the 100th token needs as much previous context as the generation of the first token, so all the previous context and generated tokens count, no matter whether you stream them out, or receive them in a batch. And when you get to the end of the reserved token area, it’s a hard stop.