Embedding - text length vs accuracy?

Thanks for the feedback @alden. Those are all amazing questions:

  • The classification is done with a fine-tuned Ada model, to keep costs and latencies under control. I trained this classifier with around 2k samples of generic and specific questions. To do it, I used the data that my customers submitted to my app, to ensure that it’s tailored to my domain. Basically, I got text-davinci-003 to classify these questions for me to generate the training data. And then I used this training data to fine-tune the classifier. If you don’t have enough training data, you can also generate a synthetic dataset using a high quality model.

  • Yeah, vector costs are not an issue. Storing vectors is usually very cheap. So I can double the number of docs and still do not run into troubles.

  • About the pre-processing: that’s an extremely interesting question. In my experience, a proper preprocessing enhances the semantic search results dramatically. There are tons of suitable strategies here. For me, augmenting the context of each chunk with off-chunk info works really well. For instance: including metadata about the chunk (title of the document where the chunk comes from, author, keywords extracted via NER, short chunk/document summary, etc.) I explained this idea here: The length of the embedding contents - #7 by AgusPG

Hope it helps :slight_smile:

11 Likes