How and what is the best way to break text into logical blocks?

To work with nesting, you need to break the text into blocks. How and what is the best way to break text into logical blocks?

Are you asking about chunking text for embeddings or something different?

1 Like

Yes, I’m asking about text fragmentation

The goal is to reduce without losing context. If the content is organized with headings, retaining the headings while removing text outside the current chunk is one strategy.

1 Like

Thanks for the answer!

There are not always headings, for example, there are large texts divided into paragraphs, i.e. after the title may be 5 - 7 paragraphs containing lengthy texts.

What other strategies can you advise?

With GPT-4, a very large amount of text can fit into the prompt (system/user roles), so you don’t have to split up your body of text into quite as small pieces as GPT-3 requires. I’d divide your text by headings, as @wfhbrian suggested, and count the tokens under each heading. You might be pleasantly surprised that no further splitting is needed. If further splitting is needed, use sub-headings if available, paragraph breaks, and/or semantic breaks (where the text naturally starts a new thought). I hope that’s helpful.

1 Like