Converting PDF Files Text into Embeddings

Hi! I have a bunch of pdf files and I am trying to create embeddings from it to allow users to search for things from these files. I have taken a look at the API and found two different cases: api-reference/embeddings/create and examples/get_embeddings_from_dataset (can’t include links for some reason). I am not sure if I should use the first one or the second one. If I use the second one, I’d have to turn the content into a dataset and I’m not sure if that’s a good approach. Any ideas or suggestions as to how I can accomplish this? I also want to store these files in a central location so whenever a user has a question, they can ask it and the app should search through all the files to find the relevant information. Thank you!

1 Like

This should give you a compass direction. It is not intended as a answer but a place to start from. There are also other topics there that make use of the PDF as a data source.

Note: I have not tried what I suggested but is where I would start.

HTH

You might also find these of use.

The quality of your search function whether it can give good result will depend on the data that you will send for embeddings. In your case, this will depend on your PDF extractor. You can just send the raw output to the embeddings API immediately. You can then store the result in any database and just fetch it if needed in the future.

I do not know what you are looking at. But, reading what you want to do, this is what I do now, so I can tell you my process. I’m not saying this is the only way or best way, just what I’ve been doing for the past several months:

  1. Organize your pdfs.
  2. Extract text from PDF files. I use PDF Software for Windows | FineReader PDF, but any pdf to text extractor will do.
  3. Chunk your texts. I use my own process of semantic chunking, https://www.youtube.com/watch?v=w_veb816Asg, but the basic LangChain method is to chunk by size. Here is some conversation on that: The length of the embedding contents - #21 by klcogluberk
  4. Embed your content. Here, you can vectorize it yourself using OpenAI’s embedding model. I use Weaviate text-2-vec-OpenAI transformer which has been working well for me. I believe PineCone is regarded as the Gold Standard in this field.
  5. Use cosine similarity (or similar method) to search your embeddings. Again, I use Weviate’s query system since I am using their vector store, but if you vectorize your content in your own database, then you can run the cosine similarity searches locally.
  6. Link search results back to the original PDFs. This is optional, but what I do. Remember that you exported your PDFs to text files then chunked them then embedded them? So, I upload those PDFs to my website where users also query (run cosine similarity searches against) the vector store. However, the links that come back from these searches don’t go to the text files, but back to the original PDFs.

This is an overview of the basic process that I like to recommend (because it comes with a handy flowchart!): https://www.youtube.com/watch?v=Ix9WIZpArm0&ab_channel=Chatwithdata

Good luck!

10 Likes