Token Limit For Embeddings vs. Text-Davinci-0003

@SomebodySysop from my former linguistics education base I can say that almost 4k of tokens in a chunk will contain at least several ideas in it (because a finished idea is more often contained in a paragraph or at most 3 paragraphs, so in tokens you’re about 500-600 tokens per idea at most). And the goal of using vector search is to match one idea (query) to another one (source chunk) as close as possible… Embedding more than one idea in a chunk will dilute the precision of the vector search (concept match) and make the perfect match almost unachievable.

Embedding chunks of text that big makes sense when you need vectors for clustering or classification of entire documents, or subject search. But when you need the facts search inside the documents - you need precision, and in this case it doesn’t make sense to me to vectorize texts longer than one “idea” (1-3 paragraphs or 200-600 tokens).

I would revisit your approach to check if that’s not your underlying issue…

1 Like