How to cut input text based on Open AI tokens

lachezarmitov · November 29, 2023, 3:08pm

Hi, I have this call to Open AI API and I use tiktoken to count the number of tokens before I sent the request. The problem is that when a text exceeds the 16k token limit I try to cut the text but at the moment it doesn’t work correct. For example the code now cuts a 30k token text not to the max, which is 16k, but to 3,5k tokens. Is there a way in python to cut the text correctly based on the tokens. This is the code:

num_tokens = self.count_tokens_from_text(text, "cl100k_base")
            print("Tokens:", num_tokens)
            model = 'gpt-3.5-turbo-16k'
            max_tokens = 16000  # Maximum tokens allowed by OpenAI API
            if num_tokens > max_tokens:
                # Crop the text to max_tokens
                text = text[:max_tokens + 1]  # Note the +1 here


            print(self.count_tokens_from_text(text, "cl100k_base"))

            input_text = "Develop a 5-sentence script appropriate for a YouTube-style video, "\
                         "using the provided Wikipedia article as the primary source of information."\
                         "I will provide you with a list of image descriptions."\
                         "These are the images that will be used in the video and "\
                         "I want the script to be based on these images."\
                         "Avoid unnecessary details such as suggestions about the pictures in [] brackets "\
                         "and the 'Narrator:' part before each paragraph. "\
                         "Deliver the text as one cohesive string and most importantly the script "\
                         "should be no longer than 5 sentences no matter the length of the article provided."\
                         "This is the provided Wikipedia article: "\
                         f"{text} These are the image descriptions: {image_titles}"

            input_text_tokens = self.count_tokens_from_text(input_text, "cl100k_base")
            print("Input script tokens:", input_text_tokens)

            completion = client.chat.completions.create(
                model=model,
                messages=[
                    {
                        "role": "system",
                        "content": "Develop a 5-sentence script appropriate for a YouTube-style video, "
                                   "using the provided Wikipedia article as the primary source of information."
                                   "I will provide you with a list of image descriptions."
                                   "These are the images that will be used in the video and "
                                   "I want the script to be based on these images."
                                   "Avoid unnecessary details such as suggestions about the pictures in [] brackets "
                                   "and the 'Narrator:' part before each paragraph. "
                                   "Deliver the text as one cohesive string and most importantly the script "
                                   "should be no longer than 5 sentences no matter the length of the article provided."
                    },
                    {
                        "role": "user",
                        "content": f"These are the image descriptions: {image_titles}."
                                   f"And this is the provided Wikipedia article: {text}"
                    },
                ],
            )

            message = completion.choices[0].message
            answer = message.content
            output_text_tokens = self.count_tokens_from_text(answer, "cl100k_base")
            print("Output script tokens:", output_text_tokens)
            return answer

_j · November 29, 2023, 5:02pm

The encoding of the whole text you provide to get a token count takes a bit of computation time, so you’d want to minimize that.

First you can do an informed search. Compare the string length to the token count length. You’ll have measured a characters per token. Then you can bisect and count tokens to get a much closer result, and find if higher or lower than your maximum.

Then do an intelligent binary search in chunks to discover a result closer to the token count you want and the character position it appears near.

Then split naturally around that point at words or sentences.

Topic		Replies	Views
Completions API: how to pre-evaluate number of tokens needed? API	3	243	May 11, 2024
How can I adjust the length of the prompt so that it does not exceed the max tokens? API api	4	3318	December 18, 2023
openai.error.InvalidRequestError: Token limit exceeded HOWEVER the input, prompt, and output are far below the token limit API api	5	6636	February 9, 2024
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1779	September 4, 2023
Counting Tokens and Rendering Content in HTML (Not the tags) Prompting chatgpt , api , token	6	1611	October 19, 2023

How to cut input text based on Open AI tokens

Related topics