Thanks for the feedback @alden. Those are all amazing questions:
-
The classification is done with a fine-tuned
Ada
model, to keep costs and latencies under control. I trained this classifier with around 2k samples of generic and specific questions. To do it, I used the data that my customers submitted to my app, to ensure that it’s tailored to my domain. Basically, I gottext-davinci-003
to classify these questions for me to generate the training data. And then I used this training data to fine-tune the classifier. If you don’t have enough training data, you can also generate a synthetic dataset using a high quality model. -
Yeah, vector costs are not an issue. Storing vectors is usually very cheap. So I can double the number of docs and still do not run into troubles.
-
About the pre-processing: that’s an extremely interesting question. In my experience, a proper preprocessing enhances the semantic search results dramatically. There are tons of suitable strategies here. For me, augmenting the context of each chunk with off-chunk info works really well. For instance: including metadata about the chunk (title of the document where the chunk comes from, author, keywords extracted via NER, short chunk/document summary, etc.) I explained this idea here: The length of the embedding contents - #7 by AgusPG
Hope it helps