Reducing a PDF to a feature vector (or "data extraction")

Hi all!

I am currently working with a use case where we are reducing a pdf to a feature vector through an LLM pipeline, much similar to a RAG architecture. We use LangChain, Vector Stores, Chunking, etc. The goal is to stuff a prompt with relevant data, and then have an LLM reduce data to a specific feature vector. I guess you could call it data extraction, because we are looking for certain data points within the pdfs.

We are thinking of evaluating the prompt design and optimal chunk size with RAGAS. I do have some questions:

  • Would you consider this as a RAG architecture, or is it something else?
  • How to evaluate the output – are there any good options to RAGAS?
  • Do you know of any similar use cases I could have a look at?

Thank you.

Med vänlig hälsning,
Linus Östlund

Hi,

  • Would you consider this as a RAG architecture, or is it something else?

RAG is a large umbrella term for any number of augmented prompt and prompt context generators, so sure, you could include it under that.

  • How to evaluate the output – are there any good options to RAGAS?

You might take a look at the open ai evals framework for evaluation ideas GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

  • Do you know of any similar use cases I could have a look at?

I do not personally, hopefully others may contribute with other cases.

Great reply!

I had a look at openai/evals, and it seems quite complex. But it is great to know official resources from OpenAI, so thank you :pray:

1 Like