OpenAI Embeddings - Search through ~1000 PDFs

SomebodySysop · August 28, 2024, 7:28am

Update: I use a number of text extractors now: PyMuPdf, AWS Textract, PdftoText, Solr tika, Marker… I even use models sometimes to extract text from images and/or pdfs: Gemini 1.5 Pro and GPt-4o.

My semantic chunking process has also evolved significantly: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop

If all of this is too much, here is a tutorial I think covers the basic nuts and bolts of the embedding process: https://www.youtube.com/watch?v=Ix9WIZpArm0

Topic		Replies	Views
Converting PDF Files Text into Embeddings API	4	34105	December 18, 2023
Is the OpenAI Embedding working well in the NodeJS? API embeddings	11	4000	March 6, 2024
Embedding Longer Texts API	8	14046	December 25, 2023
What's the appropriate way to convert pdfs to text files? Prompting	6	4217	December 23, 2023
[Tutorial] Specific knowledge base + Open AI answering questions using it (for noobs) Documentation	3	8243	December 17, 2023

OpenAI Embeddings - Search through ~1000 PDFs

Related topics