Reasonable text length for embedding

heiko · January 11, 2023, 7:36pm

Hi Guys,
I wonder if there are any thoughts or experiences about what the most reasonable length would be for a text to get an embedding for.
I am thinking about splitting longer text into parts for semantic search and im wondering how long such a part should be.
I would assume the longer the paragraph I embed the more “blurry” the embedding gets? Since a longer paragraph implements more semantic concepts than a shorter paragraph would?
Does that make sense? Any thoughts on that?

raymonddavey · January 11, 2023, 8:39pm

It depends on what you are going to do with the results of the search

If you are going to use it to ask GPT a follow on question, you want it to be about 1/2 the tokens of the model you are going to use (eg Davinci is 4096, so 2048 would be my target)

But if you want to combine the top 2 or three results, and then ask a question, I would make then 700 tokens to 1000 tokens long

If you are going to just output what you find, then any length would be OK, and it would depend on the relevance of the text to the end user once it is found

Your comment about blurriness if correct. But sometimes (not always) the paragraph before and after can still add to the score in a useful way - even though it is not a direct hit.

heiko · January 11, 2023, 9:23pm

thanks for you answer. very helpful. I am playing around with a code of law. since I thought that the paragraphs are by definition semantically autonomous.
So I want to put the embeddings into pinecone and try to find back relevant paragraphs when I ask a question and embeds this and pull related docs from pinecone.

raymonddavey · January 11, 2023, 9:38pm

When you say “code of law”, do you mean legal laws and procedures?

Do the paragraphs refer to each other at all (eg refer to sub clause 2)c) etc)

If they do, then it becomes a lot trickier to gather all the contexts you need and I would go with a smaller token size embedding, and use logic to pull out these references to get the original source text they are referring to

I haven’t done it - but I imagine that’s how you might go about it.

heiko · January 11, 2023, 9:41pm

I am trying with quite basic content. the paragraphs define different laws. so each of them describe a separate legal concept or aspect. so where I am testing there is not much cross reference or so
And the token size is way lower than you were talking about.

id say between 10 and 70 words or so

Topic		Replies	Views
Optimal token size for embeddings model? API	2	3642	December 25, 2023
Embedding Longer Texts API	8	13778	December 25, 2023
Embeddings results using Ada-Embedding-data-002 API	10	2362	March 29, 2023
Embedding - text length vs accuracy? API	13	13909	December 25, 2023
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2129	December 15, 2023

Reasonable text length for embedding

Related topics