Extracting long text from document

Hi there,

I am using chatGPT 4.0 Turbo to extract information from legal contract. The input contract is in Japanese with about 23k tokens.

My purpose is to extract the whole content of the term in the contract that is related to a specific item. However, the chatGPT only return a part of the content of the term.

For example, if the contract format is as follow:

...

Term 1: When this Agreement is terminated due to the expiration of the lease period, cancellation ...
    (1) Some sub-term ....
    (2) Some sub-term ....
    (3) Some sub-term ....
    (4) Some sub-term ....

Term 2: If there is a change in the person responsible ...
    (1) Some sub-term ....
    (2) Some sub-term ....
    (3) Some sub-term ....
    (4) Some sub-term ....

ChatGPT returns:

Term 1: When this Agreement is terminated due to the expiration of the lease period, cancellation ...
    (1) Some sub-term ....
    (2) Some sub-term ....

The sub-term (3) and (4) are often missed in the extraction result. My prompt is to command chatGPT to extract the whole content of the term, but the result is not as expected.

Can anybody help me on this? Thank you.

Can you share the actual prompt you are using?

Provided that the content related to a specific item or multiple items is actually within the output token limit of 4,096, you could try a couple of options:

  1. Ask it specifically to return the content verbatim
  2. Reinforce wording around extracting ALL of the associated sub-terms

There’s a couple of caveats though:

  1. In practice, the output tokens that a GPT model returns is frequently significantly below the 4,096 token limit. A common average is between 700-1,200 tokens. Depending on prompt and task, you may get to 2,000 and higher but often that can’t be achieved consistently.

  2. Tasks involving the return of output verbatim is particularly computation heavy in my experience and you often need to break down the tasks into smaller chunks in order to achieve this effectively.

Personally, I would probably try to go for a hybrid approach whereby you use GPT-4 to help identify the relevant content and then use local code to perform the extraction of the actual content.

Thank you for your answer.

I am using LangChain as the framework to do the extraction job with JSON formatted output. My work is related to the commercial product, then I cannot share the exact prompt, but it can be like this one:

You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
"""
{"description": "Type4 contains the question that extract the terms/clauses", "properties": {"Other expenses": {"title": "Other Expenses", "description": "The term related to Expenses, other expenses, miscellaneous expenses, consumption tax, expenses to be borne by Party B, rent and facility planning and operation expenses, use of services. The result MUST include the full content of the term, including all its inner sub-terms, clauses and sections.", "type": "array", "items": {"type": "string"}}}}
"""
### <THE_INPUT_CONTRACT_HERE>

“Personally, I would probably try to go for a hybrid approach whereby you use GPT-4 to help identify the relevant content and then use local code to perform the extraction of the actual content.” → Do you think technique like RAG can help in this case?

Thanks for sharing. However, based on the information provided I am not yet sure I sufficiently understand the logic of your prompt and/or the what it is you are asking the model to perform, limiting my ability to provide meaningful guidance.

As a general suggestion, you might want to consolidate your instructions in one place in the prompt, e.g. at the top rather than having them in multiple places. The order could be: Instructions, JSON Schema, Input contract, each with sub-headers so that these different inputs are more clearly delineated.

In your instructions, just be specific what you want the model to do and as part of that emphasize that it needs to return all relevant sub-terms verbatim.

To your other question: RAG could be an option - but here again I don’t think I have sufficient information to provide meaningful input. Sorry!