Since the token limit for API call is 4096, which means we have to separate long document anyway. How does 32k model make a difference from 8k model?
Where did you get the idea the token limit is 4096?
From the document
Depending on the model used, requests can use up to 4097 tokens shared between prompt and completion. If your prompt is 4000 tokens, your completion can be 97 tokens at most.
The limit is currently a technical limitation, but there are often creative ways to solve problems within the limit, e.g. condensing your prompt, breaking the text into smaller pieces, etc.
That document was written during the time of GPT-3 models. See
text-davinci-003, etc that have a token limit of 4,097.
That part of the documentation hasn’t apparently been updated yet.
The token limit for
gpt-4 is 8192.