I’ve been building embeddings models for semantic search and as I continue to build, I am mindful of optimal data practices. What would the optimal token size range be for building a embedded vectors model?
For example, if I have a 1000 word document, what is the optimal size to split this document into assuming it can be split evenly? Of course keeping together relevant information is important but simply looking for a guideline.
Hi @pbergin11 and welcome to a very busy OpenAI community.
I don’t know the optimal upper limit, but I do know the lower limit is critical because short phrases and text snippets create very poor quality embedding vectors.
Someone recently (@raymonddavey as I recall) confirmed here the lower limit should be around 300-500 (words or tokens, I forgot, sorry).
I found it easier to create a test UI and experiment with vector searches and test for optimal text sizes. For example, have a DB full of completions and I search the DB using various methods, including OpenAI embedding vectors and review the quality of the results.
From a lot of testing, I can state with some authority that short phrases provide very poor results for vector-based searches (and the traditional DB searches perform much better, for short text lengths and keywords, and of course property configured full-text DB searches are very good for these short text-length searches as well).