I have been comparing gpt-3.5-turbo-0613 and gpt-4-0613 function call syntax against the non function call sysntax for compound text processing tasks (e.g. do many things to input text in one api call) that output json. I was hoping to switch to the function call syntax using gpt-3.5-turbo-0613 due to it’s lower cost and speed, but so far the function call API call produces unreliable results.
Specifically, I pass in a voice transcript (from Whisper) and ask the model to perform 4 tasks and return the results in a json object:
- Generate a title based on the contents
- Split the transcript into paragraphs
- Extract one-word categories from the transcript
- Extract action items if the transcript contains todos/action items
- What happens fairly reliably with the function call syntax is the model outputs the json object, but with empty values for all fields except for the transcript. It returns the input transcript as is without breaking it into paragraphs. The same prompt without the function call syntax produces the expected results. Note: I’m not using the system prompt for 3.5.
- In these requests,
temperature=0
,function_call="auto"
Here’s the functions array I’m passing:
"functions": [
{
"name": "processEntryData",
"parameters": {
"type": "object",
"properties": {
"transcript": {"type": "string", "description": "The transcript text separated into paragraphs by two `\n\n` newline"},
"title": {"type": "string", "description": "The generated title from the transcript contents"},
"action_items": {"type": "array", "items": {"type": "string"}},
"categories": {"type": "array", "items": {"type": "string"}, "description": "Categories describing the note"}
},
"required": ["transcript", "title", "categories"]
}
}
],
- In contrast, gpt-4-0613, the function call syntax works as expected, producing the desired results reliably, in a json object.
- Similarly, I get the desired results when removing the function call syntax with gpt-3.5-turbo
I imagine these sorts of compound tasks are maybe not what the model is trained to do, compared to a query as in the examples like “what’s the weather in boston?”.
I wanted to share this experience to serve as a guide to others, and also curious if anyone else is experiencing similar results.