Converting PDF Files Text into Embeddings

vedant18 · October 12, 2023, 7:27pm

Hi! I have a bunch of pdf files and I am trying to create embeddings from it to allow users to search for things from these files. I have taken a look at the API and found two different cases: api-reference/embeddings/create and examples/get_embeddings_from_dataset (can’t include links for some reason). I am not sure if I should use the first one or the second one. If I use the second one, I’d have to turn the content into a dataset and I’m not sure if that’s a good approach. Any ideas or suggestions as to how I can accomplish this? I also want to store these files in a central location so whenever a user has a question, they can ask it and the app should search through all the files to find the relevant information. Thank you!

EricGT · October 12, 2023, 10:51pm

This should give you a compass direction. It is not intended as a answer but a place to start from. There are also other topics there that make use of the PDF as a data source.

Note: I have not tried what I suggested but is where I would start.

HTH

You might also find these of use.

supershaneski · October 12, 2023, 11:52pm

The quality of your search function whether it can give good result will depend on the data that you will send for embeddings. In your case, this will depend on your PDF extractor. You can just send the raw output to the embeddings API immediately. You can then store the result in any database and just fetch it if needed in the future.

SomebodySysop · October 13, 2023, 4:07am

I do not know what you are looking at. But, reading what you want to do, this is what I do now, so I can tell you my process. I’m not saying this is the only way or best way, just what I’ve been doing for the past several months:

Organize your pdfs.
Extract text from PDF files. I use PDF Software for Windows | FineReader PDF, but any pdf to text extractor will do.
Chunk your texts. I use my own process of semantic chunking, https://www.youtube.com/watch?v=w_veb816Asg, but the basic LangChain method is to chunk by size. Here is some conversation on that: The length of the embedding contents - #21 by klcogluberk
Embed your content. Here, you can vectorize it yourself using OpenAI’s embedding model. I use Weaviate text-2-vec-OpenAI transformer which has been working well for me. I believe PineCone is regarded as the Gold Standard in this field.
Use cosine similarity (or similar method) to search your embeddings. Again, I use Weviate’s query system since I am using their vector store, but if you vectorize your content in your own database, then you can run the cosine similarity searches locally.
Link search results back to the original PDFs. This is optional, but what I do. Remember that you exported your PDFs to text files then chunked them then embedded them? So, I upload those PDFs to my website where users also query (run cosine similarity searches against) the vector store. However, the links that come back from these searches don’t go to the text files, but back to the original PDFs.

This is an overview of the basic process that I like to recommend (because it comes with a handy flowchart!): https://www.youtube.com/watch?v=Ix9WIZpArm0&ab_channel=Chatwithdata

Good luck!

Topic		Replies	Views
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	3338	August 28, 2024
What's the appropriate way to convert pdfs to text files? Prompting	6	4860	December 23, 2023
Best way to process PDF File that has over 100k lines? API embeddings , gpt-35-turbo , api	6	8296	December 14, 2024
Making embeddings more accurate? API embeddings	7	2704	December 17, 2023
Seeking Advice: Uploading Large PDFs for Analysis with GPT-3 API API gpt-35-turbo , chatgpt , fine-tuning , api	7	7074	December 13, 2023

Converting PDF Files Text into Embeddings

Related topics