How I can split text into paragraphs?

ddesyllas · November 28, 2024, 11:53am

An alternate approach is to use completion chat with some prompt engineering and formatted response. Therefore I made this python lib (named extract_paragraphs):

from openai import OpenAI
import json

def extractParagraphs(client: OpenAI, text: str):
    text = text.strip()

    if (text == ""):
        raise ValueError("String should noty be an empty string")

    prompt = """
        You are a tool that splits the incoming texts and messages into paragraphs and extracts any title from text
        Do not alter the incoming message just output it as a json with split paragraphs. 

        The text is coming from PDF and DOCX files, therefore omit any page numbers page headers and footers.


        The Json output should be the following:
        ```
        {
          "text_title":string,

          "paragraphs":[
            {
              "title":string,
              "paragraph":string
            }
          ]
        }
        ```

        * "text_title" is the title of incomming text
        * "paragraphs" is an array with split paragraphs upon each paragraph:
          * "title" is the paragraph title if there's none set it as empty string
          * "paragraph" is the paragraph content

        Feel free to trim any excess opr unwanted whitespaces and multiple newlines and do not pretty print the json.
        Replace multiple tabs and spaces in the incomming text with a single space character.
        The output should be raw json that is NOT into markdown markup.
    """

    response_format={
        "type":"json_schema",
        "json_schema":{
            "name": "paragraph_response",
            "strict": True,
            "schema": {
                "type": "object",
                "properties":{
                    "text_title":{
                        "type":"string"
                    },
                    "paragraphs":{
                        "type": "array",
                        "items": {
                            "type":"object",
                            "properties":{
                                "title":{ "type":"string"},
                                "paragraph":{"type":"string"}
                            },
                            "required": ["title", "paragraph"],
                            "additionalProperties": False
                        }
                    }
                },
                "required": ["text_title","paragraphs"],
                "additionalProperties": False
            }
        }
    }

    response = client.chat.completions.create(model="gpt-4o", messages=[
        {"role": "system", "content": prompt},
        {"role": "user", "content": text}
    ],response_format=response_format)

    content = extractChatCompletionMessage(response)

    return json.loads(content)

def extractChatCompletionMessage(response):
    return  response.choices[0].message.content

The idea is to use formatted response with a fixed schema and upon system message describe that I want to split the text into paragraphs.

Then I could use it as:

from pypdf import PdfReader
from openai import OpenAI
from extract_paragraphs import extractParagraphs

def getTextFromPDF(fileName):
    text = ""
    reader = PdfReader(fileName)
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

path="mypdf.pdf"

openai = OpenAI()

content = getTextFromPDF(path)
paragraphs = extractParagraphs(content)

print(paragraphs)

Have you used this approach?
I want to know any known pitfalls compared to using a custom model.

Topic		Replies	Views
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	18652	October 31, 2024
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	2647	August 28, 2024
Training with Large PDF FIles API	10	23078	December 15, 2023
What's the appropriate way to convert pdfs to text files? Prompting	6	4108	December 23, 2023
Aggregated answer across multiple documents (Q&A) API	6	3173	March 14, 2023

How I can split text into paragraphs?

Related topics