Making embeddings more accurate?

I currently have a model using the Ada-002 text embeddings, then querying from there using GPT 3.5, this model searches over a BUNCH of PDF’s containg product specifications.

I’ve noticed there are some simple and understandable mistakes the model makes with finding the right information, I think it is as a result of the text formatting on itself, the model doesn’t quite understand it in the way its visually presented.

How could I fine tune/customize/improve the model to be more accurate and better understand the data set?

I am aware that one cannot fine tune the embeddings models, are there ways to improve the quality/accuracy of the results?


Visually represented? How did you embed your PDFs?

I meant the difference between what a PDF looks like visually versus how the model can read it.

In particular the organization of the rows. See below

I currently have it held in a blob and I am reading it via ML client.

Right now it’s not possible. Do you think that the problem was related to formatting or words/concepts that the models do not have enough knowledge about it?

I am aware that one cannot fine tune the embeddings models, are there ways to improve the quality/accuracy of the results?

Q&A retrieval performance may also be improved with techniques like HyDE, in which questions are first transformed into hypothetical answers before being embedded. Similarly, GPT can also potentially improve search results by automatically transforming questions into sets of keywords or search terms.

I’m still not sure what you mean.

When you send your content to the embedding endpoint you send it as a single raw string. You would need to preprocess your PDF file first. Unless I’m mistaken you can’t send files and the endpoint does not assume anything of the content it receives.

Are you using some sort of intermediate service to process it for you?

Yes, I’m processing it via Azures urifolder to faiss processor.

It takes a folder via uri and processes it then uses Ada.

pipeline_job = urifolder_to_faiss(parameters etc)

Would you recommend a different text processor?

Ah. This would be an issue with your processor and not the embedding model. I haven’t used it before & can’t help you any further, sorry.

You may want to see how it’s processing your PDFs.

As you mentioned it’s difficult to respect the visual boundaries of a PDF. Some bake in text, some don’t.

I would recommend an OCR like Google Document AI. Their models are pretty good and can be further trained.