Embedding - text length vs accuracy?

AgusPG · March 29, 2023, 11:44am

Thanks for the feedback @alden. Those are all amazing questions:

The classification is done with a fine-tuned Ada model, to keep costs and latencies under control. I trained this classifier with around 2k samples of generic and specific questions. To do it, I used the data that my customers submitted to my app, to ensure that it’s tailored to my domain. Basically, I got text-davinci-003 to classify these questions for me to generate the training data. And then I used this training data to fine-tune the classifier. If you don’t have enough training data, you can also generate a synthetic dataset using a high quality model.
Yeah, vector costs are not an issue. Storing vectors is usually very cheap. So I can double the number of docs and still do not run into troubles.
About the pre-processing: that’s an extremely interesting question. In my experience, a proper preprocessing enhances the semantic search results dramatically. There are tons of suitable strategies here. For me, augmenting the context of each chunk with off-chunk info works really well. For instance: including metadata about the chunk (title of the document where the chunk comes from, author, keywords extracted via NER, short chunk/document summary, etc.) I explained this idea here: The length of the embedding contents - #7 by AgusPG

Hope it helps

Topic		Replies	Views
How to Optimize Text Chunking for Improved Embedding Vectorization? API vector-db , semantic-search	6	10553	December 15, 2023
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2377	December 15, 2023
Embeddings results using Ada-Embedding-data-002 API	10	2382	March 29, 2023
Embedding and searching from similar embeddings API	6	6591	October 27, 2023
Prompting with the chat/completions API against a large transcript file API	5	3574	October 4, 2023