Is it bad to cut in the middle of a sentence with Embeddings?

adam.winther · June 30, 2023, 9:21am

Hi.

I am converting PDF’s to text, to then give to the embeddings API.
I am using CharacterTextSplitter to split the document into sections. A possible problem with this, is that it often cuts the sentences in the middle. I am not too familiar with Embeddings, so I wanted to ask, if this could cause problems, and possibly cause it to misunderstand/not get the whole context of a sentence/section.

An example could be:
Section 1: “For example, bacteria will spoil milk in two or three hours if the milk is left out on the kitchen counter at”
Section 2: “the kitchen counter at room temperature. However, by reducing the temperature of the milk,”

EDIT: Just to clarify, my sections are substantially longer than the examples. Chunk size of 1000 with a 200 chunk overlap.

Foxalabs · June 30, 2023, 9:47am

Welcome to the forum!

So long as you have overlap there should always be a chunk with the full sentence in it.

adam.winther · June 30, 2023, 11:19am

Thanks for the quick reply!
Often the there is no chunk with the full sentence, much like the example, it can have a bit of the first sentence, but not the whole sentence. Maybe I should have a larger overlap?

Foxalabs · June 30, 2023, 11:21am

For sure, I usually run 66-75% overlap, embeds and storage are cheap compared to the value gained by accurate results.

adam.winther · June 30, 2023, 11:36am

Thank you very much for the help

Topic		Replies	Views
Linking Embeddings For Large Article?! Community embeddings	2	876	June 28, 2023
How to let chatgpt fully digest a really large text? API	7	4916	December 16, 2023
Understanding Embedding Granularity API	7	1782	December 17, 2023
Splitting text into chunks versus reducing the text API embeddings , ada	9	1195	April 5, 2024
Embedding Longer Texts API	8	12456	December 25, 2023

Is it bad to cut in the middle of a sentence with Embeddings?

Related Topics