Is it bad to cut in the middle of a sentence with Embeddings?

Hi.

I am converting PDF’s to text, to then give to the embeddings API.
I am using CharacterTextSplitter to split the document into sections. A possible problem with this, is that it often cuts the sentences in the middle. I am not too familiar with Embeddings, so I wanted to ask, if this could cause problems, and possibly cause it to misunderstand/not get the whole context of a sentence/section.

An example could be:
Section 1: “For example, bacteria will spoil milk in two or three hours if the milk is left out on the kitchen counter at”
Section 2: “the kitchen counter at room temperature. However, by reducing the temperature of the milk,”

EDIT: Just to clarify, my sections are substantially longer than the examples. Chunk size of 1000 with a 200 chunk overlap.

1 Like

Welcome to the forum!

So long as you have overlap there should always be a chunk with the full sentence in it.

3 Likes

Thanks for the quick reply!
Often the there is no chunk with the full sentence, much like the example, it can have a bit of the first sentence, but not the whole sentence. Maybe I should have a larger overlap?

For sure, I usually run 66-75% overlap, embeds and storage are cheap compared to the value gained by accurate results.

2 Likes

Thank you very much for the help :slight_smile:

1 Like