Hey everyone, I’m new to AI world and I’m a bit unsure if I’m in the right spot, but here goes.
I’ve got this puzzle I’m working on with a pile of small PDFs—like 1000 of them. They’re short, around 5-7 pages each.
What I want to do is use some embedding magic to turn these documents into vectors and stick them in a database. The plan is, when I toss a question into the mix, I can scan this vector database to find the closest match to my query. his extra context should help OpenAI serve up a spot-on answer.
I have to questions regarding this:
I want the response to include the document name or title where the answer was snatched from. (Is this even possible?)
I’m curious if I can cut down on the embedding expenses by creating these vectors only once and saving them in a database. That way, when I fire off a new question, I can simply embed the prompt and hunt for the closest match in my pre-existing vector stash. And then use this result as a context to my question to llm?
Update: I use a number of text extractors now: PyMuPdf, AWS Textract, PdftoText, Solr tika, Marker… I even use models sometimes to extract text from images and/or pdfs: Gemini 1.5 Pro and GPt-4o.