My task is to process very big text (kind of a book). What is the best way to do that with completions ? My basic plan is to split it to chunks and process the one by one. But how to estimate the maximum size of chunk I can provide? Or is it easier to process this error that text is too long and in this case split it?
You can measure the encoded length of a text passage by using the tiktoken library. Then pass the sections that you may propose to send (split by sentence or paragraph) until you are beyond your threshold.
Token usage of English writing can be estimated at 4 tokens per character, with that compression becoming less the further you stray from germanic and latin languages.
Each model has a context length, from which the response also must be formed within.
Newer OpenAI models block receiving of over 4k of text as a response. So if you task is “improve” this, you’ll probably want to limit to 2k tokens, or even 1k will ensure quality through the length generated.
tiktoken shows me around 1 token for English word or I miss smth? Also about 4k limit - what about GPT-4 turbo with 128k tokens?
All current OpenAI language models use the same token encoder.