Hi, I have this call to Open AI API and I use tiktoken to count the number of tokens before I sent the request. The problem is that when a text exceeds the 16k token limit I try to cut the text but at the moment it doesn’t work correct. For example the code now cuts a 30k token text not to the max, which is 16k, but to 3,5k tokens. Is there a way in python to cut the text correctly based on the tokens. This is the code:
num_tokens = self.count_tokens_from_text(text, "cl100k_base")
print("Tokens:", num_tokens)
model = 'gpt-3.5-turbo-16k'
max_tokens = 16000 # Maximum tokens allowed by OpenAI API
if num_tokens > max_tokens:
# Crop the text to max_tokens
text = text[:max_tokens + 1] # Note the +1 here
print(self.count_tokens_from_text(text, "cl100k_base"))
input_text = "Develop a 5-sentence script appropriate for a YouTube-style video, "\
"using the provided Wikipedia article as the primary source of information."\
"I will provide you with a list of image descriptions."\
"These are the images that will be used in the video and "\
"I want the script to be based on these images."\
"Avoid unnecessary details such as suggestions about the pictures in [] brackets "\
"and the 'Narrator:' part before each paragraph. "\
"Deliver the text as one cohesive string and most importantly the script "\
"should be no longer than 5 sentences no matter the length of the article provided."\
"This is the provided Wikipedia article: "\
f"{text} These are the image descriptions: {image_titles}"
input_text_tokens = self.count_tokens_from_text(input_text, "cl100k_base")
print("Input script tokens:", input_text_tokens)
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": "Develop a 5-sentence script appropriate for a YouTube-style video, "
"using the provided Wikipedia article as the primary source of information."
"I will provide you with a list of image descriptions."
"These are the images that will be used in the video and "
"I want the script to be based on these images."
"Avoid unnecessary details such as suggestions about the pictures in [] brackets "
"and the 'Narrator:' part before each paragraph. "
"Deliver the text as one cohesive string and most importantly the script "
"should be no longer than 5 sentences no matter the length of the article provided."
},
{
"role": "user",
"content": f"These are the image descriptions: {image_titles}."
f"And this is the provided Wikipedia article: {text}"
},
],
)
message = completion.choices[0].message
answer = message.content
output_text_tokens = self.count_tokens_from_text(answer, "cl100k_base")
print("Output script tokens:", output_text_tokens)
return answer