An alternate approach is to use completion chat with some prompt engineering and formatted response. Therefore I made this python lib (named extract_paragraphs
):
from openai import OpenAI
import json
def extractParagraphs(client: OpenAI, text: str):
text = text.strip()
if (text == ""):
raise ValueError("String should noty be an empty string")
prompt = """
You are a tool that splits the incoming texts and messages into paragraphs and extracts any title from text
Do not alter the incoming message just output it as a json with split paragraphs.
The text is coming from PDF and DOCX files, therefore omit any page numbers page headers and footers.
The Json output should be the following:
```
{
"text_title":string,
"paragraphs":[
{
"title":string,
"paragraph":string
}
]
}
```
* "text_title" is the title of incomming text
* "paragraphs" is an array with split paragraphs upon each paragraph:
* "title" is the paragraph title if there's none set it as empty string
* "paragraph" is the paragraph content
Feel free to trim any excess opr unwanted whitespaces and multiple newlines and do not pretty print the json.
Replace multiple tabs and spaces in the incomming text with a single space character.
The output should be raw json that is NOT into markdown markup.
"""
response_format={
"type":"json_schema",
"json_schema":{
"name": "paragraph_response",
"strict": True,
"schema": {
"type": "object",
"properties":{
"text_title":{
"type":"string"
},
"paragraphs":{
"type": "array",
"items": {
"type":"object",
"properties":{
"title":{ "type":"string"},
"paragraph":{"type":"string"}
},
"required": ["title", "paragraph"],
"additionalProperties": False
}
}
},
"required": ["text_title","paragraphs"],
"additionalProperties": False
}
}
}
response = client.chat.completions.create(model="gpt-4o", messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": text}
],response_format=response_format)
content = extractChatCompletionMessage(response)
return json.loads(content)
def extractChatCompletionMessage(response):
return response.choices[0].message.content
The idea is to use formatted response with a fixed schema and upon system
message describe that I want to split the text into paragraphs.
Then I could use it as:
from pypdf import PdfReader
from openai import OpenAI
from extract_paragraphs import extractParagraphs
def getTextFromPDF(fileName):
text = ""
reader = PdfReader(fileName)
for page in reader.pages:
text += page.extract_text() + "\n"
return text
path="mypdf.pdf"
openai = OpenAI()
content = getTextFromPDF(path)
paragraphs = extractParagraphs(content)
print(paragraphs)
Have you used this approach?
I want to know any known pitfalls compared to using a custom model.