Counting Tokens and Rendering Content in HTML (Not the tags)

So I am curious if anybody have recommendation to handle token counting faster. I have context with 300k characters sometimes which is including HTML tags and all other actual text content. Currently we are removing html tags, which takes time as well but maybe lang-chain can help with this (not sure) or any other lib/tools. Briefly, I need to count the tokens and truncate it, basic truncation (not meaningful) to provide only part of the text in the prompt because of prompt token limitations. So this token counting takes incredibly long in large content with 300k character. This is what I am trying to solve basically.

Counting tokens faster than what?

There is no need to ever count 300k character documents. You can’t submit them to an AI in anywhere near that size so can truncate the file much shorter. (4096 x 5) is a good estimation of gpt-3.5-turbo and best-case English. Your question is 126 tokens and 625 characters.

You can also chunk the file at token cut points, such as newlines then followed by text where the text will almost always be a new token. Or the end of numbers. Then you can spin multiple counting threads.

-In node JS, I am using gpt-3-encoder that openAI recommended at the end of this page

  • We have many contents in different languages as well but I can do something basic to reduce it to 1/5 or less and I am aware of it. But it would have work better with just english content.

  • 300k chars easily overload the CPU and slow to process with gpt-3-encoder plus removing html tags.

You must use a tokenizer supporting cl100k-base – as used on all current OpenAI models.

Why would you want to “count” a 300000 character document? You know it won’t fit into any AI context. You need to split it into reasonable sized pieces first that would be expected to meet your truncation requirements, and then you can measure the tokens to refine.

I am aware that gpt-3-encoder is not supporting gpt4 cl100k-base encoding. I only used it because it was recommended by OpenAI in token counting page or their cookbook since I wasnt sure how to trust other available libs. Somehow they just updated to dqbd/tiktoken on that page. When I shared with you in previous message, it was gpt-3-encode. I think it was good enough for my usage but I will change it especially now since they are recommending this now.

Clearly I am aware of the token limit. I wasn’t trying to use that 300k character in prompt but there are many different contents with varying lengths, such as 0 character to 500k character . So what I do usually to count token and truncate the content accordingly to pass in prompt. But I changed to handle token counting in chunk to improve the performance impact and only use what we need and ignore rest of the tokens in 300k character. I was curious if there is any other way or tool.

I am aware that gpt-3-encoder is not supporting gpt4 cl100k-base encoding. I only used it because it was recommended by OpenAI in token counting page or their cookbook since I wasnt sure how to trust other recommended options. Somehow they just updated to dqbd/tiktoken on that page. When I shared with you in previous message, it was gpt-3-encode. I think it was good enough for my usage but I will change it especially now since they are recommending this now.

I am aware of the token limit. I wasn’t trying to use that 300k character in prompt but there are many different contents with varying lengths, such as 0 character to 500k character . So what I do usually to count token and truncate the content accordingly to pass in prompt. But I changed to handle token counting in chunk to improve the performance impact and only use what we need and ignore rest of the tokens in 300k character. I was curious if there is any other way or tool.

I think that’s the big headline: OpenAI finally updated their token counting page. Thanks for discovering.

What is the actual application for the large documents? You want the AI to answer about them? You want them rewritten or summarized?

Since there’s no way to pass such a large document to an AI model, counting its AI tokens on the whole is kind of pointless. If dealing with augmentation of large documents into an AI’s knowledge, you’ll likely want to consider the chunking that goes along with a vector database, where a semantic search done with an embedding engine can actually get your AI relevant documents for questions.

You can make an autochunker that takes your document, splits it at likely token divisions (numbers are good, as well as the end of a series of carriage returns), and then spins off a multithreaded queue of parallel token encoders. This would be mostly to ensure that your estimations by counting words and/or characters hasn’t gone astray.

2 Likes