Converting PDF Files Text into Embeddings

I do not know what you are looking at. But, reading what you want to do, this is what I do now, so I can tell you my process. I’m not saying this is the only way or best way, just what I’ve been doing for the past several months:

  1. Organize your pdfs.
  2. Extract text from PDF files. I use PDF Software for Windows | FineReader PDF, but any pdf to text extractor will do.
  3. Chunk your texts. I use my own process of semantic chunking, https://www.youtube.com/watch?v=w_veb816Asg, but the basic LangChain method is to chunk by size. Here is some conversation on that: The length of the embedding contents - #21 by klcogluberk
  4. Embed your content. Here, you can vectorize it yourself using OpenAI’s embedding model. I use Weaviate text-2-vec-OpenAI transformer which has been working well for me. I believe PineCone is regarded as the Gold Standard in this field.
  5. Use cosine similarity (or similar method) to search your embeddings. Again, I use Weviate’s query system since I am using their vector store, but if you vectorize your content in your own database, then you can run the cosine similarity searches locally.
  6. Link search results back to the original PDFs. This is optional, but what I do. Remember that you exported your PDFs to text files then chunked them then embedded them? So, I upload those PDFs to my website where users also query (run cosine similarity searches against) the vector store. However, the links that come back from these searches don’t go to the text files, but back to the original PDFs.

This is an overview of the basic process that I like to recommend (because it comes with a handy flowchart!): https://www.youtube.com/watch?v=Ix9WIZpArm0&ab_channel=Chatwithdata

Good luck!

10 Likes