Making embeddings more accurate?

BiglyBrainBear · August 14, 2023, 4:41pm

I currently have a model using the Ada-002 text embeddings, then querying from there using GPT 3.5, this model searches over a BUNCH of PDF’s containg product specifications.

I’ve noticed there are some simple and understandable mistakes the model makes with finding the right information, I think it is as a result of the text formatting on itself, the model doesn’t quite understand it in the way its visually presented.

How could I fine tune/customize/improve the model to be more accurate and better understand the data set?

I am aware that one cannot fine tune the embeddings models, are there ways to improve the quality/accuracy of the results?

Thanks!

anon10827405 · August 14, 2023, 5:02pm

Visually represented? How did you embed your PDFs?

BiglyBrainBear · August 14, 2023, 5:28pm

I meant the difference between what a PDF looks like visually versus how the model can read it.

In particular the organization of the rows. See below

I currently have it held in a blob and I am reading it via ML client.

kevin6 · August 14, 2023, 6:05pm

Right now it’s not possible. Do you think that the problem was related to formatting or words/concepts that the models do not have enough knowledge about it?

I am aware that one cannot fine tune the embeddings models, are there ways to improve the quality/accuracy of the results?

Q&A retrieval performance may also be improved with techniques like HyDE, in which questions are first transformed into hypothetical answers before being embedded. Similarly, GPT can also potentially improve search results by automatically transforming questions into sets of keywords or search terms.

anon10827405 · August 14, 2023, 6:11pm

I’m still not sure what you mean.

When you send your content to the embedding endpoint you send it as a single raw string. You would need to preprocess your PDF file first. Unless I’m mistaken you can’t send files and the endpoint does not assume anything of the content it receives.

Are you using some sort of intermediate service to process it for you?

BiglyBrainBear · August 14, 2023, 6:32pm

Yes, I’m processing it via Azures urifolder to faiss processor.

It takes a folder via uri and processes it then uses Ada.

pipeline_job = urifolder_to_faiss(parameters etc)

Would you recommend a different text processor?

anon10827405 · August 14, 2023, 6:53pm

Ah. This would be an issue with your processor and not the embedding model. I haven’t used it before & can’t help you any further, sorry.

You may want to see how it’s processing your PDFs.

As you mentioned it’s difficult to respect the visual boundaries of a PDF. Some bake in text, some don’t.

I would recommend an OCR like Google Document AI. Their models are pretty good and can be further trained.

Topic		Replies	Views
Using A Fine-Tuned Model To Query A PDF / Database API embeddings , fine-tuning , vector-db , function-calling	3	5678	December 17, 2023
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	3387	August 28, 2024
What's the appropriate way to convert pdfs to text files? Prompting	6	4966	December 23, 2023
Converting PDF Files Text into Embeddings API	4	40308	December 18, 2023
Using Embeddings for search poor results vs GPT3 API	1	768	December 17, 2023

Making embeddings more accurate?

Related topics