Reasonable text length for embedding

Hi Guys,
I wonder if there are any thoughts or experiences about what the most reasonable length would be for a text to get an embedding for.
I am thinking about splitting longer text into parts for semantic search and im wondering how long such a part should be.
I would assume the longer the paragraph I embed the more “blurry” the embedding gets? Since a longer paragraph implements more semantic concepts than a shorter paragraph would?
Does that make sense? Any thoughts on that?

2 Likes

It depends on what you are going to do with the results of the search

If you are going to use it to ask GPT a follow on question, you want it to be about 1/2 the tokens of the model you are going to use (eg Davinci is 4096, so 2048 would be my target)

But if you want to combine the top 2 or three results, and then ask a question, I would make then 700 tokens to 1000 tokens long

If you are going to just output what you find, then any length would be OK, and it would depend on the relevance of the text to the end user once it is found

Your comment about blurriness if correct. But sometimes (not always) the paragraph before and after can still add to the score in a useful way - even though it is not a direct hit.

2 Likes

thanks for you answer. very helpful. I am playing around with a code of law. since I thought that the paragraphs are by definition semantically autonomous.
So I want to put the embeddings into pinecone and try to find back relevant paragraphs when I ask a question and embeds this and pull related docs from pinecone.

When you say “code of law”, do you mean legal laws and procedures?

Do the paragraphs refer to each other at all (eg refer to sub clause 2)c) etc)

If they do, then it becomes a lot trickier to gather all the contexts you need and I would go with a smaller token size embedding, and use logic to pull out these references to get the original source text they are referring to

I haven’t done it - but I imagine that’s how you might go about it.

3 Likes

I am trying with quite basic content. the paragraphs define different laws. so each of them describe a separate legal concept or aspect. so where I am testing there is not much cross reference or so
And the token size is way lower than you were talking about.

id say between 10 and 70 words or so

1 Like