I do not know what you are looking at. But, reading what you want to do, this is what I do now, so I can tell you my process. I’m not saying this is the only way or best way, just what I’ve been doing for the past several months:
Organize your pdfs.
Extract text from PDF files. I use PDF Software for Windows | FineReader PDF , but any pdf to text extractor will do.
Chunk your texts. I use my own process of semantic chunking, https://www.youtube.com/watch?v=w_veb816Asg , but the basic LangCh…
Update: I use a number of text extractors now: PyMuPdf, AWS Textract, PdftoText, Solr tika, Marker… I even use models sometimes to extract text from images and/or pdfs: Gemini 1.5 Pro and GPt-4o.
My semantic chunking process has also evolved significantly: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop
If all of this is too much, here is a tutorial I think covers the basic nuts and bolts of the embedding process: https://www.youtube.com/watch?v=Ix9WIZpArm0
1 Like