Skip Nullable Fields When Using Function Calling For Text Extraction

I’m using a very large response model with many different nullable fields. I want to skip generating the nullable fields. I added the required fields to the “required” field and this is correctly skipping the nullable fields if returning a single object. The issue is when I try to return a list of said object. For whatever reason it will no longer respect the “required” field and instead return everything regardless of whether it’s null or not. The ultimate problem with this is not only the cost of generating all of the unnecessary null fields but the time it takes to generate them all.

When using gpt-3.5-turbo-16k-0613, 90% of the time it explicitly returns all the null values like so:

"{\n  \"tickets\": [\n    {\n      \"ticketId\": \"1\",\n      \"offenderName\": \"{redacted}\",\n      \"lawEnforcementAgency\": \"Alameda County\",\n      \"violationDate\": \"03032018\",\n      \"docketNumber\": \"WWM0001451534\",\n      \"primaryViolationDescription\": \"CVC 216555b I\",\n      \"caseNumber\": \"\",\n      \"violationIds\": [],\n      \"totalFineAmount\": \"\",\n      \"dueDate\": \"\",\n      \"status\": \"\",\n      \"disposition\": [],\n      \"ocPayNumber\": \"\",\n      \"pleaDate\": \"\",\n      \"lawEnforcementOfficer\": \"\",\n      \"retainedAttorney\": \"\",\n      \"sentenceDate\": \"\",\n      \"bail\": \"\",\n      \"bonds\": \"\",\n      \"caseReport\": \"\",\n      \"nextCourtDate\": \"\",\n      \"agency\": \"\",\n      \"drNumber\": \"\",\n      \"arrestDate\": \"\",\n      \"charge\": \"\",\n      \"custodyStatus\": \"\",\n      \"citationFilingType\": \"\",\n      \"citationFilingDate\": \"\",\n      \"orderedBail\": \"\",\n      \"postedBail\": \"\",\n      \"nextAction\": \"\",\n      \"warrantType\": \"\",\n      \"probationType\": \"\",\n      \"sentenceConvictedDate\": \"\",\n      \"fineAndPenalty\": \"\",\n      \"restitutionFine\": \"\",\n      \"chargeSeverity\": \"\",\n      \"chargeDescription\": \"\",\n      \"probationStatus\": \"\",\n      \"relatedCases\": \"\",\n      \"otherCases\": [],\n      \"actions\": [],\n      \"fineInformation\": []\n    },\n    {\n      \"ticketId\": \"2\",\n      \"offenderName\": \"{redacted}\",\n      \"lawEnforcementAgency\": \"Alameda County\",\n      \"violationDate\": \"12152022\",\n      \"docketNumber\": \"FHJ0002159428\",\n      \"primaryViolationDescription\": \"CVC 22349a I\",\n      \"caseNumber\": \"\",\n      \"violationIds\": [],\n      \"totalFineAmount\": \"\",\n      \"dueDate\": \"\",\n      \"status\": \"\",\n      \"disposition\": [],\n      \"ocPayNumber\": \"\",\n      \"pleaDate\": \"\",\n      \"lawEnforcementOfficer\": \"\",\n      \"retainedAttorney\": \"\",\n      \"sentenceDate\": \"\",\n      \"bail\": \"\",\n      \"bonds\": \"\",\n      \"caseReport\": \"\",\n      \"nextCourtDate\": \"\",\n      \"agency\": \"\",\n      \"drNumber\": \"\",\n      \"arrestDate\": \"\",\n      \"charge\": \"\",\n      \"custodyStatus\": \"\",\n      \"citationFilingType\": \"\",\n      \"citationFilingDate\": \"\",\n      \"orderedBail\": \"\",\n      \"postedBail\": \"\",\n      \"nextAction\": \"\",\n      \"warrantType\": \"\",\n      \"probationType\": \"\",\n      \"sentenceConvictedDate\": \"\",\n      \"fineAndPenalty\": \"\",\n      \"restitutionFine\": \"\",\n      \"chargeSeverity\": \"\",\n      \"chargeDescription\": \"\",\n      \"probationStatus\": \"\",\n      \"relatedCases\": \"\",\n      \"otherCases\": [],\n      \"actions\": [],\n      \"fineInformation\": []\n    }\n  ]\n}"

When using GPT-4 I get what I want (only the non null fields):

"{\n  \"tickets\": [\n    {\n      \"offenderName\": \"{redacted}\",\n      \"ticketId\": \"DN35168\",\n      \"docketNumber\": \"WWM0001451534\",\n      \"violationDate\": \"03032018\",\n      \"violationIds\": [\"CVC 216555b\"],\n      \"status\": \"I\"\n    },\n    {\n      \"offenderName\": \"{redacted}\",\n      \"ticketId\": \"JQ73467\",\n      \"docketNumber\": \"FHJ0002159428\",\n      \"violationDate\": \"12152022\",\n      \"violationIds\": [\"CVC 22349a\"],\n      \"status\": \"I\"\n    }\n  ]\n}"

Is there something I can do in my config to consistently get only the non null fields with gpt-3.5-turbo?

Here’s my config:

messages = [
        {
            "role": "system",
            "content": "You are a text extraction machine.  You will be given a large body of text and I need you to parse the given text and extract the traffic ticket information. There may be one or many tickets.  Each field is nullable.  Do not return null fields/arguments.",
        },
        {
        "role": "user",
        "content": htmlText,
    }]
    response = openai.ChatCompletion.create(
        model="gpt-4-0613",
        messages=messages,
        functions=[
            {
                "name": "get_traffic_ticket_data",
                "description": "return list of traffic tickets",
                "parameters": {
                    "properties": {
                        "tickets": {
                            "items": {
                                "properties": {
                                    "ticketId": {"type": "string", "description": "ID of the ticket."},
                                    "offenderName": {"anyOf": [
                                        {
                                            "type": "string"
                                        },
                                        {
                                            "type": "null"
                                        }
                                    ], "description": "Name of the offender."},
                                    "lawEnforcementAgency": {"type": "string", "description": "Name of the law enforcement agency."},
                                    "court": {"type": "object", "description": "Court information."},
                                    "violationDate": {"type": "string", "description": "Date of the violation."},
                                    "dueDate": {"type": "string", "description": "Due date for the ticket."},
                                    "status": {"type": "string", "description": "Status of the ticket."},
                                    "disposition": {"type": "array", "items": {"type": "string"}, "description": "array of dispositions."},
                                    "docketNumber": {"type": "string", "description": "Docket number."},
                                    "violationIds": {"type": "array", "items": {"type": "string"},
                                                     "description": "array of violation IDs."},
                                    "totalFineAmount": {"type": "string", "description": "Total amount of the fine."},
                                    "primaryViolationDescription": {"type": "string",
                                                                    "description": "Description of the primary violation."},
                                    "caseNumber": {"type": "string", "description": "Case number."},
                                    "ocPayNumber": {"type": "string", "description": "OC Pay number."},
                                    "pleaDate": {"type": "string", "description": "Date of the plea."},
                                    "lawEnforcementOfficer": {"type": "string", "description": "Name of the law enforcement officer."},
                                    "retainedAttorney": {"type": "string", "description": "Name of the retained attorney."},
                                    "sentenceDate": {"type": "string", "description": "Date of the sentence."},
                                    "bail": {"type": "string", "description": "Amount of bail."},
                                    "bonds": {"type": "string", "description": "Details about bonds."},
                                    "caseReport": {"type": "string", "description": "Case report details."},
                                    "nextCourtDate": {"type": "string", "description": "Next date for court hearing."},
                                    "agency": {"type": "string", "description": "Name of the agency."},
                                    "drNumber": {"type": "string", "description": "DR Number."},
                                    "arrestDate": {"type": "string", "description": "Date of arrest."},
                                    "charge": {"type": "string", "description": "Details about the charge."},
                                    "custodyStatus": {"type": "string", "description": "Status of custody."},
                                    "citationFilingType": {"type": "string", "description": "Type of citation filing."},
                                    "citationFilingDate": {"type": "string", "description": "Date of citation filing."},
                                    "orderedBail": {"type": "string", "description": "Amount of ordered bail."},
                                    "postedBail": {"type": "string", "description": "Amount of posted bail."},
                                    "nextAction": {"type": "string", "description": "Details of the next action."},
                                    "warrantType": {"type": "string", "description": "Type of warrant."},
                                    "probationType": {"type": "string", "description": "Type of probation."},
                                    "sentenceConvictedDate": {"type": "string", "description": "Date of sentence conviction."},
                                    "fineAndPenalty": {"type": "string", "description": "Details about fine and penalty."},
                                    "restitutionFine": {"type": "string", "description": "Amount of restitution fine."},
                                    "chargeSeverity": {"type": "string", "description": "Severity of the charge."},
                                    "chargeDescription": {"type": "string", "description": "Description of the charge."},
                                    "probationStatus": {"type": "string", "description": "Status of probation."},
                                    "relatedCases": {"type": "string", "description": "Details about related cases."},
                                    "otherCases": {
                                        "type": "array",
                                        "items": {
                                            "type": "object",
                                            "properties": {
                                                "caseNumber": {"type": "string"},
                                                "filedDate": {"type": "string"},
                                                "charges": {"type": "string"},
                                                "nextHearing": {"type": "string"},
                                                "jurisdiction": {"type": "string"},
                                                "status": {"type": "string"}
                                            }
                                        },
                                        "description": "array of other related cases."
                                    },
                                    "actions": {
                                        "type": "array",
                                        "items": {
                                            "type": "object",
                                            "properties": {
                                                "actionDate": {"type": "string"},
                                                "actionText": {"type": "string"},
                                                "disposition": {"type": "string"},
                                                "hearingType": {"type": "string"}
                                            }
                                        },
                                        "description": "array of actions related to the ticket."
                                    },
                                    "fineInformation": {
                                        "type": "array",
                                        "items": {
                                            "type": "object",
                                            "properties": {
                                                "dateToPay": {"type": "string"},
                                                "firstPayment": {"type": "string"},
                                                "priorNSF": {"type": "string"},
                                                "paymentAmount": {"type": "string"},
                                                "lastPayment": {"type": "string"},
                                                "fineNumber": {"type": "string"},
                                                "fineType": {"type": "string"},
                                                "fineDescription": {"type": "string"},
                                                "originalAmount": {"type": "string"},
                                                "paidToDate": {"type": "string"},
                                                "currentDueTotal": {"type": "string"}
                                            }
                                        },
                                        "description": "array of fine information related to the ticket."
                                    }


                                },
                                "description": "properties of a given traffic ticket.  Only include non null fields.",
                                "title": "TicketItem",
                                "type": "object",
                                "required": []
                            },
                            "title": "Tickets",
                            "type": "array",
                            "required": []
                        }
                    },
                    "title": "TrafficTicketData",
                    "type": "object",
                    "required": []
                }
            }
        ],
        function_call={
            "name": "get_traffic_ticket_data",
        },
        stream=False
    )

If I understand correctly you’re getting empty values and not nulls (lmk if I misunderstood), which is different (and actually kind of worse behavior) but maybe you can simply try changing the system message’s phrasing to refer to the empty values and not null.

Do not return null fields/arguments.Omit empty values. or try playing with it a bit…

I’d guess it can be difficult to come to a very consistent, reliable output. I’d suggest to use Promptotype to try stabilize it (full disclosure: my created tool), but it doesn’t support this complex argument types yet- coming soon possibly.

Edit: another option could be to try to switch to an instructional prompt (outputting json), instead of a function call. In my experience it can sometimes be easier to control, especially in such complex cases.

1 Like

So that others can see what’s going on…

“{
“tickets”: [
{
“ticketId”: “1”,
“offenderName”: “{redacted}”,
“lawEnforcementAgency”: “Alameda County”,
“violationDate”: “03032018”,
“docketNumber”: “WWM0001451534”,
“primaryViolationDescription”: “CVC 216555b I”,
“caseNumber”: “”,
“violationIds”: ,
“totalFineAmount”: “”,
“dueDate”: “”,
“status”: “”,
“disposition”: ,
“ocPayNumber”: “”,
“pleaDate”: “”,
“lawEnforcementOfficer”: “”,
“retainedAttorney”: “”,
“sentenceDate”: “”,
“bail”: “”,
“bonds”: “”,
“caseReport”: “”,
“nextCourtDate”: “”,
“agency”: “”,
“drNumber”: “”,
“arrestDate”: “”,
“charge”: “”,
“custodyStatus”: “”,
“citationFilingType”: “”,
“citationFilingDate”: “”,
“orderedBail”: “”,
“postedBail”: “”,
“nextAction”: “”,
“warrantType”: “”,
“probationType”: “”,
“sentenceConvictedDate”: “”,
“fineAndPenalty”: “”,
“restitutionFine”: “”,
“chargeSeverity”: “”,
“chargeDescription”: “”,
“probationStatus”: “”,
“relatedCases”: “”,
“otherCases”: ,
“actions”: ,
“fineInformation”:
},
{
“ticketId”: “2”,
“offenderName”: “{redacted}”,
“lawEnforcementAgency”: “Alameda County”,
“violationDate”: “12152022”,
“docketNumber”: “FHJ0002159428”,
“primaryViolationDescription”: “CVC 22349a I”,
“caseNumber”: “”,
“violationIds”: ,
“totalFineAmount”: “”,
“dueDate”: “”,
“status”: “”,
“disposition”: ,
“ocPayNumber”: “”,
“pleaDate”: “”,
“lawEnforcementOfficer”: “”,
“retainedAttorney”: “”,
“sentenceDate”: “”,
“bail”: “”,
“bonds”: “”,
“caseReport”: “”,
“nextCourtDate”: “”,
“agency”: “”,
“drNumber”: “”,
“arrestDate”: “”,
“charge”: “”,
“custodyStatus”: “”,
“citationFilingType”: “”,
“citationFilingDate”: “”,
“orderedBail”: “”,
“postedBail”: “”,
“nextAction”: “”,
“warrantType”: “”,
“probationType”: “”,
“sentenceConvictedDate”: “”,
“fineAndPenalty”: “”,
“restitutionFine”: “”,
“chargeSeverity”: “”,
“chargeDescription”: “”,
“probationStatus”: “”,
“relatedCases”: “”,
“otherCases”: ,
“actions”: ,
“fineInformation”:
}
]
}”

The problem is that properties that are deeper than the first nesting level get neither a description, nor do they get an “optional” marker added to them when passed to the AI. gpt-3.5-turbo is the one actually following the specification.

You’ll need to put all the properties to be optionally filled on the root level of the json, and run them as function output one at a time.

1 Like