I am currently using the OpenAI api to help retrieve some key information needed from a legal document and return the information in a JSON file. The python script I am running preprocesses the legal document to abstract the pdf as text and feeds the text along with a prompt from a seperate text file to the API. The issue that I am encountering is that although I provide explicit instructions in the prompt to review the entire extracted text for the information required, the LLM only reads and provides information for the first two pages. Any help and advice on how to get the LLM to read and provide the required information from all the pages I provide it would be greatly appreciated. Thanks!
Welcome to the Forum!
Absent of knowing what type of information you are trying to extract, it’s a bit tricky to give a certain answer. That said, as a starting point I would check if a RAG-based approach might work for your use case. If you have very specific information needs whereby the information you are looking for sits in clearly defined places in the document, then this should work.
However, if your use case is more complex than that, then it would be helpful if you could first share more information before continuing the exploration of potential other options.
@jr.2509 might “max_tokens” need adjustment when passing the text string to the LLM?
Unless the user has set a very restrictive max_tokens value, this should normally not have a material bearing. The max_tokens value only impacts the output tokens. It has no impact on the input tokens.
Appreciate the offer to help here! Here is the full code that I am currently using and running from a python notebook:
Install Libraries
%pip install pdfminer.six
%pip install matplotlib
%pip install openai
%pip install PyPDF
%pip install tenacity
%pip install PyMuPDF
Import Libraries
import os
import re
import json
import openai
from pathlib import Path
from openai import OpenAI
from PyPDF2 import PdfReader
from tenacity import retry, wait_random_exponential, stop_after_attempt
import fitz # PyMuPDF
Connect to OpenAI
model_ID = ‘gpt-4o’
client = OpenAI(api_key = ‘’)
Extract Metadata
def read_prompt(prompt_path: str):
“”"
Read the prompt for CLO indenture parsing from a text file.
“”"
with open(prompt_path, “r”) as f:
return f.read()
def extract_text_from_pdf(pdf_path: str):
“”"
Extract text content from a PDF file using the PyMuPDF library.
“”"
# Open the PDF file
document = fitz.open(pdf_path)
text =
# Iterate through each page
for page_num in range(len(document)):
page = document.load_page(page_num)
text.append(page.get_text())
# Close the document
document.close()
return "\n".join(text)
Document Parsing
@retry(wait=wait_random_exponential(min=1, max=120), stop=stop_after_attempt(10))
def completion_with_backoff(**kwargs):
return client.chat.completions.create(**kwargs)
def extract_metadata(content: str, prompt_path: str, model_id: str):
“”"
Use GPT model to extract metadata from the Indenture based on the given prompt.
“”"
# Read the prompt
prompt_data = read_prompt(prompt_path)
try:
response = completion_with_backoff(
model=model_id,
messages=[
{"role": "system", "content": prompt_data},
{"role": "user", "content": content}
],
temperature=0.2,
)
response_content = response.choices[0].message.content
if not response_content:
print("Empty response from the model")
return {}
# Remove any markdown code block indicators
response_content = re.sub(r'```json\s*', '', response_content)
response_content = re.sub(r'\s*```', '', response_content)
# Attempt to parse JSON
try:
return json.loads(response_content)
except json.JSONDecodeError as e:
print(f"Failed to parse JSON: {e}")
print(f"Raw response: {response_content}")
# Attempt to extract JSON from the response
match = re.search(r'\{.*\}', response_content, re.DOTALL)
if match:
try:
return json.loads(match.group(0))
except json.JSONDecodeError as jde:
print(f"Failed to extract valid JSON from the response: {jde}")
return {}
except Exception as e:
print(f"Error calling OpenAI API: {e}")
return {}
def process_indenture(pdf_path: str, prompt: str,
output_folder: str, model_id: str):
“”"
Process a single Indenture through the entire pipeline.
“”"
print(f"Processing CLO Indenture: {pdf_path}")
try:
# Step 1: Extract text content from the PDF
content = extract_text_from_pdf(pdf_path)
print(f"Extracted text content from PDF: {pdf_path}")
# Step 2: Extract metadata using GPT model
metadata = extract_metadata(content, prompt, model_id)
if not metadata:
print(f"Failed to extract metadata for {pdf_path}")
return
print(f"Extracted metadata using {model_id} for {pdf_path}")
# Step 3: Save the result as a JSON file
output_filename = Path(pdf_path).stem + '.json'
output_path = os.path.join(output_folder, output_filename)
with open(output_path, 'w') as f:
json.dump(metadata, f, indent=2)
print(f"Saved metadata to {output_path}")
except Exception as e:
print(f"Error processing {pdf_path}: {e}")
Example for a single document
Example for a single document
pdf_path = r"C:\Definitions Sections\CLO_VII_Indenture_Cover_and_Definitions_Pages.pdf"
prompt_path = r"C:\Parse Indentures to JSON Prompt.txt"
output_folder = r"C:\Parsed Definitions\JSON files"
process_indenture(pdf_path, prompt_path, output_folder, model_ID)
Here is the associated prompt:
CLO Indenture Document:
{document}
You are an expert in analyzing Indentures for Collateralized Loan Obligations (CLOs). Please carefully read the provided Indenture above and extract the following key information:
Extract these six (6) properties from the research paper:
-
Issuer: The full name of the Issuer in the document
-
AgreementDate: The as of date the document is dated
-
Trustee: The full name of the Trustee in the document
-
Definitions: A list of dictionaries, where each dictionary contains the following keys for each definition:
-
DefinitionTerm: The full term of the definition
-
DefinitionWording: The full wording of the definition
Guidelines:
-
Manually locate each definition on every page of the document from term A thru Z.
-
Manually parse each and every single definition term and definition wording completely.
-
The extracted information should be factual and accurate to the document.
-
The extracted entities should be self-contained and easily understood without the rest of the agreement.
-
If any property is missing from the paper, please leave the field empty rather than guessing.
Answer in JSON format. The JSON should contain 6 keys: “Issuer”, “AgreementDate”, “Trustee”, and “Definitions”. The “Definitions” should be a list of dictionaries as described above.
Thank you for sharing.
As a starting point, I have focused my attention on your prompt, which I believe is the source of at least some of the challenges you are experiencing.
The dominant issue is that you are asking the model to perform too many tasks at the same time. Specifically, the models are not “wired” to return exhaustive responses for tasks that involve extracting information from every single page of a document.
If I were in your place, I would separate the extraction of the information of the issuer, agreement date and trustee from the extraction of the definitions. The former should be pretty straightforward. The latter, in my view, requires a more bespoke approach whereby you chunk the contents of the document into smaller parts and then iteratively extract the definitions including the term and wording from the individual chunks. As part of that I would also make clearer by way of a few examples (i.e. few shot prompting) what constitutes a definition in your case including edge cases.
While this may initially seem like a less efficient approach, given that you only submit one document chunk along with the relatively short prompt at a time, the incremental cost should be minimal while your results should markedly improve.
Thanks jr! I will give these tips a try. With the current code the LLM is successfully fetching all of the fields requested including Issuer, Trustee, Date, and Definitions and returning a JSON file. So it seems to be taking in all of the directions in the prompt but its stopping after the first two pages of the document. The whole document is about 80,000 tokens which is well within the context window of 4 turbo.
Sounds good. For these type of tasks it is less about the input being within the limits of the context window but rather about the fact that model does not systematically go through all of the content. At this point, it is not something we can directly address through prompting. Chunking remains one of the most straightforward ways to “force” the model to systematically consider all content. In any case, I hope the tests work out
UPDATE
To test our theory that we have been overwhelming the LLM with our original approach I experimented with three different scenarios. All three scenarios included an adjusted prompt that only asked for the Definitions to be returned and none of the previously requested data points:
Scenario 1 - New scaled down prompt + Original document = 3 pages parsed (1 more than original scenario)
Scenario 2 - New scaled down prompt + only 5 pages of definitions (no cover) = 5 pages parsed ( all provided definitions successfully captured)
Scenario 3 - New scaled down prompt + all definitions with no cover page = 3 pages parsed (same success rate as scenario 1)
Scenario 2 did 2x better than the other 2 scenarios.
We have that feature at https://www.lawxer.ai do you want to have a quick demo? The solution will also be available as a stand-alone API ready to be integrated in 3rd party platforms.