Converting PDF Files Text into Embeddings

SomebodySysop · October 13, 2023, 4:07am

I do not know what you are looking at. But, reading what you want to do, this is what I do now, so I can tell you my process. I’m not saying this is the only way or best way, just what I’ve been doing for the past several months:

Organize your pdfs.
Extract text from PDF files. I use PDF Software for Windows | FineReader PDF, but any pdf to text extractor will do.
Chunk your texts. I use my own process of semantic chunking, https://www.youtube.com/watch?v=w_veb816Asg, but the basic LangChain method is to chunk by size. Here is some conversation on that: The length of the embedding contents - #21 by klcogluberk
Embed your content. Here, you can vectorize it yourself using OpenAI’s embedding model. I use Weaviate text-2-vec-OpenAI transformer which has been working well for me. I believe PineCone is regarded as the Gold Standard in this field.
Use cosine similarity (or similar method) to search your embeddings. Again, I use Weviate’s query system since I am using their vector store, but if you vectorize your content in your own database, then you can run the cosine similarity searches locally.
Link search results back to the original PDFs. This is optional, but what I do. Remember that you exported your PDFs to text files then chunked them then embedded them? So, I upload those PDFs to my website where users also query (run cosine similarity searches against) the vector store. However, the links that come back from these searches don’t go to the text files, but back to the original PDFs.

This is an overview of the basic process that I like to recommend (because it comes with a handy flowchart!): https://www.youtube.com/watch?v=Ix9WIZpArm0&ab_channel=Chatwithdata

Good luck!

Topic		Replies	Views
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	3371	August 28, 2024
What's the appropriate way to convert pdfs to text files? Prompting	6	4932	December 23, 2023
Best way to process PDF File that has over 100k lines? API embeddings , gpt-35-turbo , api	6	8358	December 14, 2024
Making embeddings more accurate? API embeddings	7	2721	December 17, 2023
Seeking Advice: Uploading Large PDFs for Analysis with GPT-3 API API gpt-35-turbo , chatgpt , fine-tuning , api	7	7094	December 13, 2023

Converting PDF Files Text into Embeddings

Related topics