How do you handle the tables in the PDF. That is one of the most difficult for me. For complex tables it fails
Same for us. Parser is trained to presplit cells with tab character, but often fails. So the second parcer ads pipe chars instead of tabs in formatter output. Then all blocks containing pipes (table rows) are set apart and grouped by sequential paths (full table). Then all pipes removed and the table raw text is sent to a âtable formatterâ - separate model trained on tables that returns tables as line-brake separated rows with pipe-separeted cells
But it is still shaky because of the merged cells. But I think there is a Google API that is worth testing (at least I will at some point) specific for tables
Is anyone beside me using Grobid as their PDF parser?
No, but it looks interesting, what is your experience with it, and have you tried Googleâs solution?
Iâm using the gpu version in docker, on arxiv pdfs. It reliably gets author, title, abstract, chunks intelligently on section boundaries, parses most tables, etc. Once I found it about a year ago I moved on to other things and havenât revisited. Paper ingest is a core part of my research asst (Owl), but there are lotâs of other pieces too - like information extraction, (l)ontology construction, âŚ
To me, RAG sounds promising in theory, but achieving it in real might be problem.
While retrieving and aggregating information can be improved with techniques like chunking and re-ranking, but generating accurate text from aggregated information remains a challenge, even with super tuned prompt design, thereâs still limited control over the final output.
In short, RAG is a great concept, but its practical implementation with current technology seems uncertain. Ultimately, the success of RAG hinges on the continued development of LLMs (Large Language Models).
I tend to agree that RAG is a square peg in a round hole type of issue. I think thereâs a fundamental disconnect in using LLMs in these scenarios, as they are probabilistic and with RAG we are often (not always) trying to force them into a deterministic behavior. So, Iâm not sure that any algo will fully solve for this.
When it does work itâs pretty magical, but when it doesnât its quite frustrating. And thereâs so many variables that go into the equation (size of document, format of document, number of documents, prompt used, model used, embeddings used, chunking used, similarity algo used, etc etc) that any casual developer is unlikely to ever get it to work with high reliability on their own. So after many months of experimentation we ultimately decided to wait it out and let the experts (OpenAI etc) continue to improve it, but still expecting it to never be perfect. While it is highly impressive in many scenarios, if i were a lawyer or doctor I wouldnât use it as it can be misleading a significant % of the timeâŚ
Personally finding RAG pretty effective!
Thats great to hear, what is the volume of your documents. Would really love to know your approach with respect to chunking and retrieval.
I authored and maintain a Chatbot for forum software. Admittedly the data is naturally chunked by posts as you might imagine, both in terms of size and semantic focus. It seems to work well for > 150,000 posts. Iâve yet to observe a limit. I can use user group and tag re-ranking to promote some posts.
This technique is very effective at knowledge retrieval and can occasionally be âsublimeâ.
RAG is basically what ChatGPT is and it works well.
Yes, RAG is not the way to AGI, but imho, neither is whatever ChatGPT currently is.
forget RAG - the experts are a little split, but a mounting intellectual weight is falling behind the idea that LLMs in general are not the path to AGI.
Doesnât make LLMs invalid as a tool in the interim though âŚ
I see them more like a brick in the AGI building: the âthought transformerâ - the thing that takes a thought and transforms it into text and backward.
Thatâs what they are for after all? A tool to apply language rules more complex than grammar to an idea or plan.
AGI on the other hand would be the system that generates the ideas and plans.
RAG here would be an equivalent of memory where you can use associations (vectors) to retrieve the data.
But to get the AGI you still need all the other parts of the brain and the entity: like logic, motivations, emotions (learning enforcers), patterns, character traits (like curiosity to update the motivations based on new data) and self-awareness with moral (moral needed to limit the system within the bounds, BTW who will define the bounds and how we can garantee the AGI will not update its bounds on the fly?).
Still like the other 99% of things to do to get there.
unless you have a wise logic and a solid coding behind it to control whatâs going on in your workflowsâŚ
Here is an excerpt of your document passed through the LAWXERâs internal API (#20 Money breaks):
Outline:
20. Money Breaks (Union Proposal No. 19; Producersâ Proposal No. 9)
a. Looping. Retakes, Etc. (Union Proposal No. 19.A.)
Modification Clause for Section 58 of the Television Agreement :
Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers :
Compensation Rates for V2 Hour Programs
1 Compensation Rate for 1-Hour Program
Payment Rate for Services Exceeding One Hour
b. Fittings (Union Proposal No. 19.B.)
Modification Directive: CBA, Schedule A, Section 16. A. (2) :
Compensation for Day Performers' Fittings Prior to Work Day
c. Prepaid Looping Day for a Weekly Performer Employed on a Theatrical Motion Picture at a Salary of at Least $ 10,000 Per Week (Producersâ Proposal No. 9)
Addition of Final Paragraph to Section 27.A of Schedule C of the Codified Basic Agreement :
Prepaid Looping Day Clause for Weekly Performers Earning $10,000 or More
JSON format of the section:
{
"index": 0,
"title": "20. Money Breaks (Union Proposal No. 19; Producers\u2019 Proposal No. 9)",
"name": "",
"content": "",
"type": "container",
"path": "000",
"children": [
{
"index": 0,
"title": "a. Looping. Retakes, Etc. (Union Proposal No. 19.A.)",
"name": "",
"content": "",
"type": "container",
"path": "000:000",
"children": [
{
"index": 0,
"title": "",
"name": "Modification Clause for Section 58 of the Television Agreement",
"content": "Modify Section 58 of the Television Agreement as follows:",
"type": "container",
"path": "000:000:000",
"children": [
{
"index": 0,
"title": "",
"name": "Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers",
"content": "\"The performer\u2019s contract shall not include guarantees for looping, retakes, added scenes, process transparencies, trick shots, trailers, changes or foreign versions (subject to availability) outside the period of consecutive employment, except for Schedule F performers and except that advance payment for looping, retakes, etc. is permitted as to Schedule C performers whose salaries equal or exceed:",
"type": "container",
"path": "000:000:000:000",
"children": [
{
"index": 0,
"title": "",
"name": "Compensation Rates for V2 Hour Programs",
"content": "Program Length Per Episode or Program\nV2 hour $5,000\n($6,500 for contracts entered into on or after [the first Sunday following the AMPTP_ \u2019s receipt of notice of ratification]}",
"type": "body",
"path": "000:000:000:000:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Compensation Rate for 1-Hour Program",
"content": "1 hour $7,500\n($10,000 for contracts entered into on or after [the first Sunday following the AMPTP_ \u2019s receipt of notice of ratification]}",
"type": "body",
"path": "000:000:000:000:001",
"children": []
},
{
"index": 2,
"title": "",
"name": "Payment Rate for Services Exceeding One Hour",
"content": "more than 1 hour $10,000\n($12,500 for contracts entered into on or after [the first Sunday following the AMPTP \u2019s receipt of notice of ratification]]]",
"type": "body",
"path": "000:000:000:000:002",
"children": []
}
]
}
]
}
]
},
{
"index": 1,
"title": "b. Fittings (Union Proposal No. 19.B.)",
"name": "",
"content": "",
"type": "container",
"path": "000:001",
"children": [
{
"index": 0,
"title": "",
"name": "Modification Directive: CBA, Schedule A, Section 16. A. (2)",
"content": "Modify CBA, Schedule A, Section 16. A. (2) as follows:",
"type": "body",
"path": "000:001:000",
"children": []
},
{
"index": 1,
"title": "",
"name": "Compensation for Day Performers' Fittings Prior to Work Day",
"content": "\u201c(2) Fittings on a day prior to work: \u201cWhen a day performer is fitted on a day prior to the day on which he works, he shall be entitled to one (1) hour minimum pay for each call. Additional time shall be paid for in fifteen (15) minute units. Day performers receiving over $ 1,200 $1,400 per day (over $ 1,400 $ 1,500 per day with respect to contracts entered into on or after [the first Sunday after the AMPTP *5 receipt of notice of ratification] July 1, 2020) shall not be entitled to any compensation for such fittings.\u201d",
"type": "body",
"path": "000:001:001",
"children": []
}
]
},
{
"index": 2,
"title": "c. Prepaid Looping Day for a Weekly Performer Employed on a Theatrical Motion Picture at a Salary of at Least $ 10,000 Per Week (Producers\u2019 Proposal No. 9)",
"name": "",
"content": "",
"type": "container",
"path": "000:002",
"children": [
{
"index": 0,
"title": "",
"name": "Addition of Final Paragraph to Section 27.A of Schedule C of the Codified Basic Agreement",
"content": "Add the following as the last paragraph of Section 27. A. of Schedule C of the Codified Basic Agreement:",
"type": "container",
"path": "000:002:000",
"children": [
{
"index": 0,
"title": "",
"name": "Prepaid Looping Day Clause for Weekly Performers Earning $10,000 or More",
"content": "\u201cThe Producer may bargain with any weekly performer employed on a theatrical motion picture at a salary of $ 10.000 or more per week to include one (1) prepaid looping day in the performer\u2019s compensation. The performer\u2019s employment contract shall contain a separate provision to that effect and a box must be provided next to the prepayment provision for the performer to initial to indicate acceptance.\u201d",
"type": "body",
"path": "000:002:000:000",
"children": []
}
]
}
]
},
{
"index": 3,
"title": "",
"name": "",
"content": "",
"type": "other",
"path": "000:003",
"children": [
{
"index": 0,
"title": "",
"name": "",
"content": "00303313.DOCX; 4 51",
"type": "other",
"path": "000:003:000",
"children": []
}
]
}
]
}
RAG engine storable elements
Here is what I would store in my RAG engine to work with this data:
[
{
"class": "Element",
"properties": {
"title": "20. Money Breaks (Union Proposal No. 19; Producers\u2019 Proposal No. 9)",
"name": "",
"content": "",
"outline": "20. Money Breaks (Union Proposal No. 19; Producers\u2019 Proposal No. 9)\n a. Looping. Retakes, Etc. (Union Proposal No. 19.A.)\n Modification Clause for Section 58 of the Television Agreement :\n b. Fittings (Union Proposal No. 19.B.)\n Modification Directive: CBA, Schedule A, Section 16. A. (2) :\n Compensation for Day Performers' Fittings Prior to Work Day\n c. Prepaid Looping Day for a Weekly Performer Employed on a Theatrical Motion Picture at a Salary of at Least $ 10,000 Per Week (Producers\u2019 Proposal No. 9)\n Addition of Final Paragraph to Section 27.A of Schedule C of the Codified Basic Agreement :",
"path": "000",
"parentPath": "",
"parentName": "",
"document": "abcd-1234",
"order": 0,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "a. Looping. Retakes, Etc. (Union Proposal No. 19.A.)",
"name": "",
"content": "",
"outline": "a. Looping. Retakes, Etc. (Union Proposal No. 19.A.)\n Modification Clause for Section 58 of the Television Agreement :\n Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers :",
"path": "000:000",
"parentPath": "000",
"parentName": "20. Money Breaks (Union Proposal No. 19; Producers\u2019 Proposal No. 9)",
"document": "abcd-1234",
"order": 0,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Modification Clause for Section 58 of the Television Agreement",
"content": "Modify Section 58 of the Television Agreement as follows:",
"outline": "Modification Clause for Section 58 of the Television Agreement :\n Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers :\n Compensation Rates for V2 Hour Programs\n 1 Compensation Rate for 1-Hour Program\n Payment Rate for Services Exceeding One Hour",
"path": "000:000:000",
"parentPath": "000:000",
"parentName": "a. Looping. Retakes, Etc. (Union Proposal No. 19.A.)",
"document": "abcd-1234",
"order": 0,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers",
"content": "\"The performer\u2019s contract shall not include guarantees for looping, retakes, added scenes, process transparencies, trick shots, trailers, changes or foreign versions (subject to availability) outside the period of consecutive employment, except for Schedule F performers and except that advance payment for looping, retakes, etc. is permitted as to Schedule C performers whose salaries equal or exceed:",
"outline": "Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers :\n Compensation Rates for V2 Hour Programs\n 1 Compensation Rate for 1-Hour Program\n Payment Rate for Services Exceeding One Hour",
"path": "000:000:000:000",
"parentPath": "000:000:000",
"parentName": "Modification Clause for Section 58 of the Television Agreement",
"document": "abcd-1234",
"order": 0,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Compensation Rates for V2 Hour Programs",
"content": "Program Length Per Episode or Program\nV2 hour $5,000\n($6,500 for contracts entered into on or after [the first Sunday following the AMPTP_ \u2019s receipt of notice of ratification]}",
"outline": "",
"path": "000:000:000:000:000",
"parentPath": "000:000:000:000",
"parentName": "Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers",
"document": "abcd-1234",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Compensation Rate for 1-Hour Program",
"content": "1 hour $7,500\n($10,000 for contracts entered into on or after [the first Sunday following the AMPTP_ \u2019s receipt of notice of ratification]}",
"outline": "",
"path": "000:000:000:000:001",
"parentPath": "000:000:000:000",
"parentName": "Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers",
"document": "abcd-1234",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Payment Rate for Services Exceeding One Hour",
"content": "more than 1 hour $10,000\n($12,500 for contracts entered into on or after [the first Sunday following the AMPTP \u2019s receipt of notice of ratification]]]",
"outline": "",
"path": "000:000:000:000:002",
"parentPath": "000:000:000:000",
"parentName": "Exclusion of Guarantees for Additional Work Outside Consecutive Employment Period with Exceptions for Schedule F and High-Salary Schedule C Performers",
"document": "abcd-1234",
"order": 2,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "b. Fittings (Union Proposal No. 19.B.)",
"name": "",
"content": "",
"outline": "b. Fittings (Union Proposal No. 19.B.)\n Modification Directive: CBA, Schedule A, Section 16. A. (2) :\n Compensation for Day Performers' Fittings Prior to Work Day",
"path": "000:001",
"parentPath": "000",
"parentName": "20. Money Breaks (Union Proposal No. 19; Producers\u2019 Proposal No. 9)",
"document": "abcd-1234",
"order": 1,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Modification Directive: CBA, Schedule A, Section 16. A. (2)",
"content": "Modify CBA, Schedule A, Section 16. A. (2) as follows:",
"outline": "",
"path": "000:001:000",
"parentPath": "000:001",
"parentName": "b. Fittings (Union Proposal No. 19.B.)",
"document": "abcd-1234",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Compensation for Day Performers' Fittings Prior to Work Day",
"content": "\u201c(2) Fittings on a day prior to work: \u201cWhen a day performer is fitted on a day prior to the day on which he works, he shall be entitled to one (1) hour minimum pay for each call. Additional time shall be paid for in fifteen (15) minute units. Day performers receiving over $ 1,200 $1,400 per day (over $ 1,400 $ 1,500 per day with respect to contracts entered into on or after [the first Sunday after the AMPTP *5 receipt of notice of ratification] July 1, 2020) shall not be entitled to any compensation for such fittings.\u201d",
"outline": "",
"path": "000:001:001",
"parentPath": "000:001",
"parentName": "b. Fittings (Union Proposal No. 19.B.)",
"document": "abcd-1234",
"order": 1,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "c. Prepaid Looping Day for a Weekly Performer Employed on a Theatrical Motion Picture at a Salary of at Least $ 10,000 Per Week (Producers\u2019 Proposal No. 9)",
"name": "",
"content": "",
"outline": "c. Prepaid Looping Day for a Weekly Performer Employed on a Theatrical Motion Picture at a Salary of at Least $ 10,000 Per Week (Producers\u2019 Proposal No. 9)\n Addition of Final Paragraph to Section 27.A of Schedule C of the Codified Basic Agreement :\n Prepaid Looping Day Clause for Weekly Performers Earning $10,000 or More",
"path": "000:002",
"parentPath": "000",
"parentName": "20. Money Breaks (Union Proposal No. 19; Producers\u2019 Proposal No. 9)",
"document": "abcd-1234",
"order": 2,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Addition of Final Paragraph to Section 27.A of Schedule C of the Codified Basic Agreement",
"content": "Add the following as the last paragraph of Section 27. A. of Schedule C of the Codified Basic Agreement:",
"outline": "Addition of Final Paragraph to Section 27.A of Schedule C of the Codified Basic Agreement :\n Prepaid Looping Day Clause for Weekly Performers Earning $10,000 or More",
"path": "000:002:000",
"parentPath": "000:002",
"parentName": "c. Prepaid Looping Day for a Weekly Performer Employed on a Theatrical Motion Picture at a Salary of at Least $ 10,000 Per Week (Producers\u2019 Proposal No. 9)",
"document": "abcd-1234",
"order": 0,
"type": "container"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "Prepaid Looping Day Clause for Weekly Performers Earning $10,000 or More",
"content": "\u201cThe Producer may bargain with any weekly performer employed on a theatrical motion picture at a salary of $ 10.000 or more per week to include one (1) prepaid looping day in the performer\u2019s compensation. The performer\u2019s employment contract shall contain a separate provision to that effect and a box must be provided next to the prepayment provision for the performer to initial to indicate acceptance.\u201d",
"outline": "",
"path": "000:002:000:000",
"parentPath": "000:002:000",
"parentName": "Addition of Final Paragraph to Section 27.A of Schedule C of the Codified Basic Agreement",
"document": "abcd-1234",
"order": 0,
"type": "body"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "",
"content": "",
"outline": "",
"path": "000:003",
"parentPath": "000",
"parentName": "20. Money Breaks (Union Proposal No. 19; Producers\u2019 Proposal No. 9)",
"document": "abcd-1234",
"order": 3,
"type": "other"
}
},
{
"class": "Element",
"properties": {
"title": "",
"name": "",
"content": "00303313.DOCX; 4 51",
"outline": "",
"path": "000:003:000",
"parentPath": "000:003",
"parentName": "",
"document": "abcd-1234",
"order": 0,
"type": "other"
}
}
]
Make sure not to include in vectors the property labels and skip property values inside the vector for:
- path
- parentPath
- document
- order
- type
RAG engine query (weaviate) :
{
Get {
Element(
nearText: {
concepts: ["Here does your question to retrieve the context for?"],
moveTo: {
concepts: ["sample of the text you are looking for, the goal is to show the rag engine how it might look like in the text body", "you may also add some keywords to be even more specific", "and some other stuff too if needed"],
force: 0.63
},
certainty: 0.698
},
where: {
path: ["document"],
operator: Equal,
valueText: "THE_DOCUMENT_ID_IF_YOU_WANT_FROM_A_SPECIFIC_DOC"
},
autocut: 3,
limit: 15,
) {
content,
title,
name,
outline,
path,
parentPath,
parentName,
document,
order,
type,
_additional {
id,
certainty
}
}
}
}
- Where clause is optional and targets a specific document.
- Autocut is a cool weaviate feature that limits the number of vector clusters your engine wants to consider (here take 3 groups of samples from the top of results)
- Limit (how many samples max to return). Never needed to go beyond 15 samples because of the quality of this beast, but I suppose in some cases you might need more.
@SomebodySysop note the stored containers with outlines (eg: path 000) where the vector will represent the general idea of a container (your document) VS the âatomic ideaâ of a chunk. This gives you the ability to zoom out of an atomic idea search and do the searches of documents or sections.
Say you need to find all documents talking about a subject, you run the query with subject vector limiting the results to elements with path length of 3 characters.
Same applies if you want to find sections talking about a specific subject with the only difference of how many levels you want to go deep into your documents (length=(level * 3 characters) + (level *1 character for the separators) - (1 character for the separator trimmed on the right side)).
So for complex searches you might go with the following logic:
- find docs/sections talking about this
- then find elements within those specific documents
- check my found samples if they answer the question or add the info my model needs
- prepare the prompt
- answer the question
Where to get access to this thing
The API will be released soon under a different brand name /siâmantics/ - the meaning as you hear itâ˘
Soon here means a month or two, will be available on https://www.simantiks.com (just bought that one) so stay tuned on LinkedIn: https://www.linkedin.com/in/sergeliatko/
I wouldnât say RAG is brittle, at least not if you do it right. There certainly limitations - after all you can only have so many functions.
If anything RAG is whatâs going to be what gives the AI field life for at least the next few years. The promises of what is functionally science fiction AI taking over jobs and doing XYZ have fallen flat, and investors arenât known for having lots of patience (ask the VR guys). RAG at least allows us to use Assistants as a new form of user interface to existing services.
FlexiAI is a flexible RAG (The father of RAGs ).
What? How is that possible, your repo is 3 weeks old?
Work and sleep and work⌠and I have to work more to build more mechanisms.
I tried to let user the power to bend
Multi Agent System using Assistant Instructions
how they want and each Agent can access the RAG.
I hope this project will help someone to build
The concept of RAG as a mechanism to ground the model and perform in-context learning is great. Itâs key to making the models useful. What doesnât really work are the current approaches associated with RAG that attempt to compress the retrieved data into the models context window. Basically itâs the chunking of text into the context window thatâs broken on so many different levelsâŚ
I view most of the current RAG techniques, including Graph RAG, as short term hacks to try and work around the limited context window of models. If there wasnât a context length limit and issues like lost in the middle were solved would you really jump through all the hoops needed for Graph RAG? NopeâŚ
The issues with context length and lost in the middle will both be solved soon (Iâm already using solutions to both in my work) and token costs are plummeting. I predict that most of the current RAG techniques wonât even be a thing in a year. There are simply better ways to approach the problem of context.
Well there is also the problem with a model that once it is trained adding new information is kind of hard. So it is not only the context window but also the need to train it constantly on new data - basically in real time.
And preferably times three because of