Using OpenAI API to parse legal documents

I am currently using the OpenAI api to help retrieve some key information needed from a legal document and return the information in a JSON file. The python script I am running preprocesses the legal document to abstract the pdf as text and feeds the text along with a prompt from a seperate text file to the API. The issue that I am encountering is that although I provide explicit instructions in the prompt to review the entire extracted text for the information required, the LLM only reads and provides information for the first two pages. Any help and advice on how to get the LLM to read and provide the required information from all the pages I provide it would be greatly appreciated. Thanks!

1 Like

Welcome to the Forum!

Absent of knowing what type of information you are trying to extract, it’s a bit tricky to give a certain answer. That said, as a starting point I would check if a RAG-based approach might work for your use case. If you have very specific information needs whereby the information you are looking for sits in clearly defined places in the document, then this should work.

However, if your use case is more complex than that, then it would be helpful if you could first share more information before continuing the exploration of potential other options.

3 Likes

@jr.2509 might “max_tokens” need adjustment when passing the text string to the LLM?

Unless the user has set a very restrictive max_tokens value, this should normally not have a material bearing. The max_tokens value only impacts the output tokens. It has no impact on the input tokens.

4 Likes

Appreciate the offer to help here! Here is the full code that I am currently using and running from a python notebook:

Install Libraries

%pip install pdfminer.six
%pip install matplotlib
%pip install openai
%pip install PyPDF
%pip install tenacity
%pip install PyMuPDF

Import Libraries

import os
import re
import json
import openai
from pathlib import Path
from openai import OpenAI
from PyPDF2 import PdfReader
from tenacity import retry, wait_random_exponential, stop_after_attempt
import fitz # PyMuPDF

Connect to OpenAI

model_ID = ‘gpt-4o’

client = OpenAI(api_key = ‘’)

Extract Metadata

def read_prompt(prompt_path: str):
“”"
Read the prompt for CLO indenture parsing from a text file.
“”"
with open(prompt_path, “r”) as f:
return f.read()

def extract_text_from_pdf(pdf_path: str):
“”"
Extract text content from a PDF file using the PyMuPDF library.
“”"
# Open the PDF file
document = fitz.open(pdf_path)
text =

# Iterate through each page
for page_num in range(len(document)):
    page = document.load_page(page_num)
    text.append(page.get_text())

# Close the document
document.close()

return "\n".join(text)

Document Parsing

@retry(wait=wait_random_exponential(min=1, max=120), stop=stop_after_attempt(10))
def completion_with_backoff(**kwargs):
return client.chat.completions.create(**kwargs)

def extract_metadata(content: str, prompt_path: str, model_id: str):
“”"
Use GPT model to extract metadata from the Indenture based on the given prompt.
“”"

# Read the prompt
prompt_data = read_prompt(prompt_path)

try:
    response = completion_with_backoff(
        model=model_id,
        messages=[
            {"role": "system", "content": prompt_data},
            {"role": "user", "content": content}
        ],
        temperature=0.2,
    )

    response_content = response.choices[0].message.content
    if not response_content:
        print("Empty response from the model")
        return {}

    # Remove any markdown code block indicators
    response_content = re.sub(r'```json\s*', '', response_content)
    response_content = re.sub(r'\s*```', '', response_content)

    # Attempt to parse JSON
    try:
        return json.loads(response_content)
    except json.JSONDecodeError as e:
        print(f"Failed to parse JSON: {e}")
        print(f"Raw response: {response_content}")

        # Attempt to extract JSON from the response
        match = re.search(r'\{.*\}', response_content, re.DOTALL)
        if match:
            try:
                return json.loads(match.group(0))
            except json.JSONDecodeError as jde:
                print(f"Failed to extract valid JSON from the response: {jde}")

        return {}

except Exception as e:
    print(f"Error calling OpenAI API: {e}")
    return {}

def process_indenture(pdf_path: str, prompt: str,
output_folder: str, model_id: str):
“”"
Process a single Indenture through the entire pipeline.
“”"
print(f"Processing CLO Indenture: {pdf_path}")

try:
    # Step 1: Extract text content from the PDF
    content = extract_text_from_pdf(pdf_path)
    print(f"Extracted text content from PDF: {pdf_path}")

    # Step 2: Extract metadata using GPT model
    metadata = extract_metadata(content, prompt, model_id)
    if not metadata:
        print(f"Failed to extract metadata for {pdf_path}")
        return
    print(f"Extracted metadata using {model_id} for {pdf_path}")

    # Step 3: Save the result as a JSON file
    output_filename = Path(pdf_path).stem + '.json'
    output_path = os.path.join(output_folder, output_filename)

    with open(output_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    print(f"Saved metadata to {output_path}")

except Exception as e:
    print(f"Error processing {pdf_path}: {e}")

Example for a single document

Example for a single document

pdf_path = r"C:\Definitions Sections\CLO_VII_Indenture_Cover_and_Definitions_Pages.pdf"
prompt_path = r"C:\Parse Indentures to JSON Prompt.txt"
output_folder = r"C:\Parsed Definitions\JSON files"

process_indenture(pdf_path, prompt_path, output_folder, model_ID)

Here is the associated prompt:

CLO Indenture Document:


{document}


You are an expert in analyzing Indentures for Collateralized Loan Obligations (CLOs). Please carefully read the provided Indenture above and extract the following key information:

Extract these six (6) properties from the research paper:

  • Issuer: The full name of the Issuer in the document

  • AgreementDate: The as of date the document is dated

  • Trustee: The full name of the Trustee in the document

  • Definitions: A list of dictionaries, where each dictionary contains the following keys for each definition:

  • DefinitionTerm: The full term of the definition

  • DefinitionWording: The full wording of the definition

Guidelines:

  • Manually locate each definition on every page of the document from term A thru Z.

  • Manually parse each and every single definition term and definition wording completely.

  • The extracted information should be factual and accurate to the document.

  • The extracted entities should be self-contained and easily understood without the rest of the agreement.

  • If any property is missing from the paper, please leave the field empty rather than guessing.

Answer in JSON format. The JSON should contain 6 keys: “Issuer”, “AgreementDate”, “Trustee”, and “Definitions”. The “Definitions” should be a list of dictionaries as described above.

1 Like

Thank you for sharing.

As a starting point, I have focused my attention on your prompt, which I believe is the source of at least some of the challenges you are experiencing.

The dominant issue is that you are asking the model to perform too many tasks at the same time. Specifically, the models are not “wired” to return exhaustive responses for tasks that involve extracting information from every single page of a document.

If I were in your place, I would separate the extraction of the information of the issuer, agreement date and trustee from the extraction of the definitions. The former should be pretty straightforward. The latter, in my view, requires a more bespoke approach whereby you chunk the contents of the document into smaller parts and then iteratively extract the definitions including the term and wording from the individual chunks. As part of that I would also make clearer by way of a few examples (i.e. few shot prompting) what constitutes a definition in your case including edge cases.

While this may initially seem like a less efficient approach, given that you only submit one document chunk along with the relatively short prompt at a time, the incremental cost should be minimal while your results should markedly improve.

1 Like

Thanks jr! I will give these tips a try. With the current code the LLM is successfully fetching all of the fields requested including Issuer, Trustee, Date, and Definitions and returning a JSON file. So it seems to be taking in all of the directions in the prompt but its stopping after the first two pages of the document. The whole document is about 80,000 tokens which is well within the context window of 4 turbo.

1 Like

Sounds good. For these type of tasks it is less about the input being within the limits of the context window but rather about the fact that model does not systematically go through all of the content. At this point, it is not something we can directly address through prompting. Chunking remains one of the most straightforward ways to “force” the model to systematically consider all content. In any case, I hope the tests work out :slight_smile:

1 Like

UPDATE

To test our theory that we have been overwhelming the LLM with our original approach I experimented with three different scenarios. All three scenarios included an adjusted prompt that only asked for the Definitions to be returned and none of the previously requested data points:

Scenario 1 - New scaled down prompt + Original document = 3 pages parsed (1 more than original scenario)

Scenario 2 - New scaled down prompt + only 5 pages of definitions (no cover) = 5 pages parsed ( all provided definitions successfully captured)

Scenario 3 - New scaled down prompt + all definitions with no cover page = 3 pages parsed (same success rate as scenario 1)

Scenario 2 did 2x better than the other 2 scenarios.

We have that feature at https://www.lawxer.ai do you want to have a quick demo? The solution will also be available as a stand-alone API ready to be integrated in 3rd party platforms.

1 Like