How and what is the best way to break text into logical blocks?

Kostarica · March 16, 2023, 3:29pm

To work with nesting, you need to break the text into blocks. How and what is the best way to break text into logical blocks?

wfhbrian · March 16, 2023, 3:39pm

Are you asking about chunking text for embeddings or something different?

Kostarica · March 16, 2023, 5:12pm

Yes, I’m asking about text fragmentation

wfhbrian · March 16, 2023, 5:19pm

The goal is to reduce without losing context. If the content is organized with headings, retaining the headings while removing text outside the current chunk is one strategy.

Kostarica · March 16, 2023, 5:28pm

Thanks for the answer!

There are not always headings, for example, there are large texts divided into paragraphs, i.e. after the title may be 5 - 7 paragraphs containing lengthy texts.

What other strategies can you advise?

lmccallum · March 16, 2023, 5:47pm

With GPT-4, a very large amount of text can fit into the prompt (system/user roles), so you don’t have to split up your body of text into quite as small pieces as GPT-3 requires. I’d divide your text by headings, as @wfhbrian suggested, and count the tokens under each heading. You might be pleasantly surprised that no further splitting is needed. If further splitting is needed, use sub-headings if available, paragraph breaks, and/or semantic breaks (where the text naturally starts a new thought). I hope that’s helpful.

Topic		Replies	Views
Summarizing and extracting structured data from long text Prompting gpt-4 , api , token , limitations	14	12866	February 19, 2024
Prompt engineering to summarize in a more human like structure Prompting	5	1072	July 9, 2021
Document Cutting Prompting	4	703	August 3, 2021
Practical Tips for Dealing with Large Documents (>2048 tokens) API	6	8616	December 17, 2023
Rewriting long documents Prompting	3	1755	January 10, 2023

How and what is the best way to break text into logical blocks?

Related topics