Optimal token size for embeddings model?

pbergin11 · February 9, 2023, 2:13am

Hi all!

I’ve been building embeddings models for semantic search and as I continue to build, I am mindful of optimal data practices. What would the optimal token size range be for building a embedded vectors model?

For example, if I have a 1000 word document, what is the optimal size to split this document into assuming it can be split evenly? Of course keeping together relevant information is important but simply looking for a guideline.

ruby_coder · February 9, 2023, 2:50am

Hi @pbergin11 and welcome to a very busy OpenAI community.

I don’t know the optimal upper limit, but I do know the lower limit is critical because short phrases and text snippets create very poor quality embedding vectors.

Someone recently (@raymonddavey as I recall) confirmed here the lower limit should be around 300-500 (words or tokens, I forgot, sorry).

I found it easier to create a test UI and experiment with vector searches and test for optimal text sizes. For example, have a DB full of completions and I search the DB using various methods, including OpenAI embedding vectors and review the quality of the results.

Test Lab

From a lot of testing, I can state with some authority that short phrases provide very poor results for vector-based searches (and the traditional DB searches perform much better, for short text lengths and keywords, and of course property configured full-text DB searches are very good for these short text-length searches as well).

HTH

Topic		Replies	Views
Reasonable text length for embedding API	5	2318	December 25, 2023
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2359	December 15, 2023
Embedding Longer Texts API	8	14832	December 25, 2023
Which embedding tokenizer should I use? API	6	7038	September 5, 2023
Embedding - text length vs accuracy? API	13	15362	December 25, 2023

Optimal token size for embeddings model?

Test Lab

Related topics