OpenAI Embeddings - Search through ~1000 PDFs

dejan_ai · November 10, 2023, 7:50pm

Hey everyone, I’m new to AI world and I’m a bit unsure if I’m in the right spot, but here goes.

I’ve got this puzzle I’m working on with a pile of small PDFs—like 1000 of them. They’re short, around 5-7 pages each.

What I want to do is use some embedding magic to turn these documents into vectors and stick them in a database. The plan is, when I toss a question into the mix, I can scan this vector database to find the closest match to my query. his extra context should help OpenAI serve up a spot-on answer.

I have to questions regarding this:

I want the response to include the document name or title where the answer was snatched from. (Is this even possible?)
I’m curious if I can cut down on the embedding expenses by creating these vectors only once and saving them in a database. That way, when I fire off a new question, I can simply embed the prompt and hunt for the closest match in my pre-existing vector stash. And then use this result as a context to my question to llm?

vickyanco · August 27, 2024, 12:45pm

Hi! I need to develop something similar. I was wondering how did you en up doing it. Thanks

SomebodySysop · August 28, 2024, 7:28am

Update: I use a number of text extractors now: PyMuPdf, AWS Textract, PdftoText, Solr tika, Marker… I even use models sometimes to extract text from images and/or pdfs: Gemini 1.5 Pro and GPt-4o.

My semantic chunking process has also evolved significantly: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop

If all of this is too much, here is a tutorial I think covers the basic nuts and bolts of the embedding process: https://www.youtube.com/watch?v=Ix9WIZpArm0

nadav.kavalerchik · August 28, 2024, 8:34am

You can use this tool with OpenAI embedding model

Topic		Replies	Views
Converting PDF Files Text into Embeddings API	4	41110	December 18, 2023
Feeding data then ask questions about it API	1	1559	February 28, 2024
Embedding Longer Texts API	8	15301	December 25, 2023
What's the appropriate way to convert pdfs to text files? Prompting	6	5124	December 23, 2023
Problem with doing RAG with 300k pages of PDFs Community gpt-4 , gpt-35-turbo , api	8	5651	March 7, 2024

OpenAI Embeddings - Search through ~1000 PDFs

Related topics