Using gpt-4 API to Semantically Chunk Documents

jr.2509 · April 14, 2024, 6:55am

Ok, so reporting back from some of my own testing over the past few hours. I started with a version of this approach from above using gpt-4-turbo-2024-04-09.

Applied to my documents, I realized two things: (1) The split was not yet granular enough for my purposes; (2) The model failed to appropriately identify the last words of a chunk in 50%+ of the cases.

This led me to a series of refinements. In my latest iterations I used the following prompt along with the example JSON schema:

Prompt

{
    "model": "gpt-4-turbo-2024-04-09",
    "response_format": { "type": "json_object" },    
    "messages": [
{
        "role": "system",
        "content": "You are tasked with performing a document analysis for the purpose of breaking it down into its logical units. You assume that each document consists of multiple sections with each section in turn consisting of sub-sections, which are further broken down into logical units. Logical units represent the smallest hierarchical unit in a document and consists of multiple sentences that are logically interlinked and convey an information or idea. Each section must be analyzed for its structure individually and may consist of one or multiple sub-sections and logical units. Your output consists of a JSON of the document's outline that follows the logic of the JSON schema provided. You must strictly consider all of the document's content in your analysis. JSON schema: '''JSON_Schema_'''"
      },
{
        "role": "user",
        "content":"Document_Text"
}
    ],
    "temperature": 0,
    "max_tokens": 4000,
    "top_p": 1,
    "frequency_penalty": 0,
    "presence_penalty": 0
  }

JSON Schema (not exhaustive)

{
    "document_outline": [

    "title": "document title" 
    "sections": {
        "section_1": {
            "section_title": "section title (verbatim)",
            "sub_sections": {
                "sub_section_1":{
                    "sub_section_title":"title of sub-section (verbatim, N/A in case of no title)",
                    "logical_units": {
                        "logical_unit_1":"first five words in the logical unit (verbatim)",
                        "logical_unit_2":"first five words in the logical unit (verbatim)",
                        "logical_unit_3":"first five words in the logical unit (verbatim)",
                        }
                    },
                "sub_section_2":{
                    "sub_section_title":"title of sub-section (verbatim, N/A in case of no title)",
                    "logical_units": {
                        "logical_unit_1":"first five words in the logical unit (verbatim)",
                        "logical_unit_2":"first five words in the logical unit (verbatim)",
                        "logical_unit_3":"first five words in the logical unit (verbatim)",
                        }
                   }
            }
        },
        "section_2": {
            "section_title": "section title (verbatim)",
            "sub_sections": {
                "sub_section_1":{
                    "sub_section_title":"title of sub-section (verbatim, N/A in case of no title)",
                    "logical_units": {
                        "logical_unit_1":"first five words in the logical unit (verbatim)",
                        "logical_unit_2":"first five words in the logical unit (verbatim)",
                        "logical_unit_3":"first five words in the logical unit (verbatim)",
                        }
                            },
                "sub_section_2":{
                    "sub_section_title":"title of sub-section (verbatim, N/A in case of no title)",
                    "logical_units": {
                        "logical_unit_1":"first five words in the logical unit (verbatim)",
                        "logical_unit_2":"first five words in the logical unit (verbatim)",
                        "logical_unit_3":"first five words in the logical unit (verbatim)",
                    }
                }
            }
        }                
    }
]
}

Based on a few tests (a short 8-page document and a 28-page document), this is so far yielded the most precise results in terms of the breakdown into logical units in line with how it would be useful for my purposes.

That said, it is not yet the final solution and more work is required for the analysis to be exhaustive and more accurate. Also, due to the JSON schema it is now overly focused on returning three logical units (despite the system instructions). So I need to accommodate for that.

I have a few things running in parallel but hopefully I can further advance the testing in the coming days.

Topic		Replies	Views
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	66	31922	April 1, 2025
The length of the embedding contents API	48	34368	December 13, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4492	January 26, 2024
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45343	December 12, 2023
Poor quality response on trained LLM with pdf files Community gpt-4	29	6313	May 1, 2024

Using gpt-4 API to Semantically Chunk Documents

Related topics