Token Counting with sentence

Hi,

I have a long text coming off Whisper transcription. Before it can be sent to GPT, the text needed to be chunked into smaller pieces to fit the max token limit.

I have been using Tiktoken library and it has been working fine, dividing into 4096 tokens.

What do you use if you want to divide by sentences? I also need the current chunk to begin with the last sentence from last chunk to provide a bit more context.

Thanks.

I don’t program so it took me lots of prompt engineering to get our GPT friend to do the right thing. It’s mainly because the ChatGPT couldn’t understand when I want the second chunk to start with the last sentence of first chunk to do overlapse.

But finally, we got there.

It is using the tiktoken to count and re library to divide the sentences.

# Split the text into sentences
sentences = re.split('(?<=[.!?]) +', text)

# Group the sentences into chunks
max_chunk_size = 4000 # Taking consideration of the extra tokens of prompts
chunks = []
current_chunk = ''
current_chunk_tokens = 0

for i, sentence in enumerate(sentences):
    sentence_tokens = len(encoding.encode(sentence))
    if current_chunk_tokens + sentence_tokens > max_chunk_size:
        # Add the current chunk to the list of chunks
        chunks.append((current_chunk_tokens, current_chunk))
        # Start a new chunk with the last sentence of the previous chunk
        current_chunk = sentences[i-1] + sentence
        current_chunk_tokens = len(encoding.encode(current_chunk))
    else:
        # Add the sentence to the current chunk, with overlap if not first chunk
        if current_chunk:
            current_chunk += ' ' + sentence
            current_chunk_tokens += sentence_tokens
        else:
            current_chunk = sentence
            current_chunk_tokens = sentence_tokens

# Add the last chunk to the list of chunks
chunks.append((current_chunk_tokens, current_chunk))

# Print each chunk and its token count on a new line
for i, (tokens, chunk) in enumerate(chunks):
    print(f'Chunk [{i}]: {chunk}\nTokens: {tokens}\n')