I’m currently experimenting with a Vectordatabase and data (extracted from pdf). Those pdfs are mostly product information papers.
For example, I want to ask similar questions:
- “Which papers haven’t been updated since 2018?”
- “How many different material providers are there?”
The current vectordatabase is fed with a few hundred pds. But of course for those type of questions, the Vectordatabase is not the right approach. What else do I have to do to achieve that?
Embeddings/vector databases is generally the right approach for chatting with your data.
Those questions seem to more be about the metadata of the papers rather than the content. You’ll need to find a way to generate proper embeddings for that type of content, or detect if the question is about the metadata rather than content and give it a CSV of metadata, or have it write a SQL query, or something along those lines.
thanks for your reply
How would it be possible to detect questions about the metadata? There are no such “keywords” for that, besides that it wouldn’t be very user-friendly to use special keywords…
Regarding the CSV, would you create a whole csv file with every data? The problem I see here is that the CSV file would be large, since we are talking about ~200k PDF documents.
I forgot to mention another problem for the CSV file: all pdf files are different, meaning that the location of any “date” or necessary data is placed somewhere else.
For the question “Which papers haven’t been updated since 2018?” this wouldn’t work, besides that you could create the csv manually (I wont do that)