I currently have a model using the Ada-002 text embeddings, then querying from there using GPT 3.5, this model searches over a BUNCH of PDF’s containg product specifications.
I’ve noticed there are some simple and understandable mistakes the model makes with finding the right information, I think it is as a result of the text formatting on itself, the model doesn’t quite understand it in the way its visually presented.
How could I fine tune/customize/improve the model to be more accurate and better understand the data set?
I am aware that one cannot fine tune the embeddings models, are there ways to improve the quality/accuracy of the results?
Right now it’s not possible. Do you think that the problem was related to formatting or words/concepts that the models do not have enough knowledge about it?
I am aware that one cannot fine tune the embeddings models, are there ways to improve the quality/accuracy of the results?
Q&A retrieval performance may also be improved with techniques like HyDE, in which questions are first transformed into hypothetical answers before being embedded. Similarly, GPT can also potentially improve search results by automatically transforming questions into sets of keywords or search terms.
When you send your content to the embedding endpoint you send it as a single raw string. You would need to preprocess your PDF file first. Unless I’m mistaken you can’t send files and the endpoint does not assume anything of the content it receives.
Are you using some sort of intermediate service to process it for you?