RAG is not really a solution


How do you handle the tables in the PDF. That is one of the most difficult for me. For complex tables it fails

Same for us. Parser is trained to presplit cells with tab character, but often fails. So the second parcer ads pipe chars instead of tabs in formatter output. Then all blocks containing pipes (table rows) are set apart and grouped by sequential paths (full table). Then all pipes removed and the table raw text is sent to a “table formatter” - separate model trained on tables that returns tables as line-brake separated rows with pipe-separeted cells

But it is still shaky because of the merged cells. But I think there is a Google API that is worth testing (at least I will at some point) specific for tables

Is anyone beside me using Grobid as their PDF parser?


No, but it looks interesting, what is your experience with it, and have you tried Google’s solution?

I’m using the gpu version in docker, on arxiv pdfs. It reliably gets author, title, abstract, chunks intelligently on section boundaries, parses most tables, etc. Once I found it about a year ago I moved on to other things and haven’t revisited. Paper ingest is a core part of my research asst (Owl), but there are lot’s of other pieces too - like information extraction, (l)ontology construction, …

To me, RAG sounds promising in theory, but achieving it in real might be problem.

While retrieving and aggregating information can be improved with techniques like chunking and re-ranking, but generating accurate text from aggregated information remains a challenge, even with super tuned prompt design, there’s still limited control over the final output.

In short, RAG is a great concept, but its practical implementation with current technology seems uncertain. Ultimately, the success of RAG hinges on the continued development of LLMs (Large Language Models).