How I can split text into paragraphs?

An alternate approach is to use completion chat with some prompt engineering and formatted response. Therefore I made this python lib (named extract_paragraphs):

from openai import OpenAI
import json

def extractParagraphs(client: OpenAI, text: str):
    text = text.strip()

    if (text == ""):
        raise ValueError("String should noty be an empty string")

    prompt = """
        You are a tool that splits the incoming texts and messages into paragraphs and extracts any title from text
        Do not alter the incoming message just output it as a json with split paragraphs. 

        The text is coming from PDF and DOCX files, therefore omit any page numbers page headers and footers.


        The Json output should be the following:
        ```
        {
          "text_title":string,

          "paragraphs":[
            {
              "title":string,
              "paragraph":string
            }
          ]
        }
        ```

        * "text_title" is the title of incomming text
        * "paragraphs" is an array with split paragraphs upon each paragraph:
          * "title" is the paragraph title if there's none set it as empty string
          * "paragraph" is the paragraph content

        Feel free to trim any excess opr unwanted whitespaces and multiple newlines and do not pretty print the json.
        Replace multiple tabs and spaces in the incomming text with a single space character.
        The output should be raw json that is NOT into markdown markup.
    """

    response_format={
        "type":"json_schema",
        "json_schema":{
            "name": "paragraph_response",
            "strict": True,
            "schema": {
                "type": "object",
                "properties":{
                    "text_title":{
                        "type":"string"
                    },
                    "paragraphs":{
                        "type": "array",
                        "items": {
                            "type":"object",
                            "properties":{
                                "title":{ "type":"string"},
                                "paragraph":{"type":"string"}
                            },
                            "required": ["title", "paragraph"],
                            "additionalProperties": False
                        }
                    }
                },
                "required": ["text_title","paragraphs"],
                "additionalProperties": False
            }
        }
    }

    response = client.chat.completions.create(model="gpt-4o", messages=[
        {"role": "system", "content": prompt},
        {"role": "user", "content": text}
    ],response_format=response_format)

    content = extractChatCompletionMessage(response)

    return json.loads(content)

def extractChatCompletionMessage(response):
    return  response.choices[0].message.content

The idea is to use formatted response with a fixed schema and upon system message describe that I want to split the text into paragraphs.

Then I could use it as:

from pypdf import PdfReader
from openai import OpenAI
from extract_paragraphs import extractParagraphs

def getTextFromPDF(fileName):
    text = ""
    reader = PdfReader(fileName)
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

path="mypdf.pdf"

openai = OpenAI()

content = getTextFromPDF(path)
paragraphs = extractParagraphs(content)

print(paragraphs)

Have you used this approach?
I want to know any known pitfalls compared to using a custom model.