Question regarding "own" data model

0xr · August 28, 2023, 1:52pm

Hi,

I’m currently experimenting with a Vectordatabase and data (extracted from pdf). Those pdfs are mostly product information papers.

For example, I want to ask similar questions:

“Which papers haven’t been updated since 2018?”
“How many different material providers are there?”

The current vectordatabase is fed with a few hundred pds. But of course for those type of questions, the Vectordatabase is not the right approach. What else do I have to do to achieve that?

Best regards

novaphil · August 28, 2023, 2:37pm

Embeddings/vector databases is generally the right approach for chatting with your data.

Those questions seem to more be about the metadata of the papers rather than the content. You’ll need to find a way to generate proper embeddings for that type of content, or detect if the question is about the metadata rather than content and give it a CSV of metadata, or have it write a SQL query, or something along those lines.

0xr · August 30, 2023, 7:16am

Hi @novaphil,

thanks for your reply

How would it be possible to detect questions about the metadata? There are no such “keywords” for that, besides that it wouldn’t be very user-friendly to use special keywords…

Regarding the CSV, would you create a whole csv file with every data? The problem I see here is that the CSV file would be large, since we are talking about ~200k PDF documents.

Kind regards
Oliver

0xr · August 30, 2023, 7:22am

I forgot to mention another problem for the CSV file: all pdf files are different, meaning that the location of any “date” or necessary data is placed somewhere else.

For the question “Which papers haven’t been updated since 2018?” this wouldn’t work, besides that you could create the csv manually (I wont do that)

Topic		Replies	Views
Ask questions about a pdf without storing it in vector database API chatgpt , api , rag , development , assistants-api	4	1174	July 16, 2024
How to analyze big CSV files for a chat bot? API chatgpt , api , development	1	3561	March 19, 2024
Working with CSVs/Tables/Financial Models API	2	2030	December 19, 2023
Retrieval with PDFs after parsing that have very similar data and dates API rag	8	230	February 14, 2025
Vector Database that can embed new data continuously Community vector-db	5	4136	January 24, 2025

Question regarding "own" data model

Related topics