OpenAI Embeddings - Search through ~1000 PDFs

Update: I use a number of text extractors now: PyMuPdf, AWS Textract, PdftoText, Solr tika, Marker… I even use models sometimes to extract text from images and/or pdfs: Gemini 1.5 Pro and GPt-4o.

My semantic chunking process has also evolved significantly: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop

If all of this is too much, here is a tutorial I think covers the basic nuts and bolts of the embedding process: https://www.youtube.com/watch?v=Ix9WIZpArm0

1 Like