Token Counting with sentence

nhtkid · April 8, 2023, 12:59am

Hi,

I have a long text coming off Whisper transcription. Before it can be sent to GPT, the text needed to be chunked into smaller pieces to fit the max token limit.

I have been using Tiktoken library and it has been working fine, dividing into 4096 tokens.

What do you use if you want to divide by sentences? I also need the current chunk to begin with the last sentence from last chunk to provide a bit more context.

Thanks.

nhtkid · April 8, 2023, 8:29am

I don’t program so it took me lots of prompt engineering to get our GPT friend to do the right thing. It’s mainly because the ChatGPT couldn’t understand when I want the second chunk to start with the last sentence of first chunk to do overlapse.

But finally, we got there.

It is using the tiktoken to count and re library to divide the sentences.

# Split the text into sentences
sentences = re.split('(?<=[.!?]) +', text)

# Group the sentences into chunks
max_chunk_size = 4000 # Taking consideration of the extra tokens of prompts
chunks = []
current_chunk = ''
current_chunk_tokens = 0

for i, sentence in enumerate(sentences):
    sentence_tokens = len(encoding.encode(sentence))
    if current_chunk_tokens + sentence_tokens > max_chunk_size:
        # Add the current chunk to the list of chunks
        chunks.append((current_chunk_tokens, current_chunk))
        # Start a new chunk with the last sentence of the previous chunk
        current_chunk = sentences[i-1] + sentence
        current_chunk_tokens = len(encoding.encode(current_chunk))
    else:
        # Add the sentence to the current chunk, with overlap if not first chunk
        if current_chunk:
            current_chunk += ' ' + sentence
            current_chunk_tokens += sentence_tokens
        else:
            current_chunk = sentence
            current_chunk_tokens = sentence_tokens

# Add the last chunk to the list of chunks
chunks.append((current_chunk_tokens, current_chunk))

# Print each chunk and its token count on a new line
for i, (tokens, chunk) in enumerate(chunks):
    print(f'Chunk [{i}]: {chunk}\nTokens: {tokens}\n')

Topic		Replies	Views
Token Counter / Splitter? Community chatgpt	2	1026	August 3, 2023
How to cut input text based on Open AI tokens API	1	1858	November 29, 2023
Process each sentence in a paragraph without starting a new prompt Prompting	3	1146	December 14, 2022
Counting Tokens and Rendering Content in HTML (Not the tags) Prompting chatgpt , api , token	6	1611	October 19, 2023
Completions API: how to pre-evaluate number of tokens needed? API	3	243	May 11, 2024

Token Counting with sentence

Related topics