New models are incapable of proper function calling

Yes, if you don’t want the parallel tool calls that new models can emit (for example, calling four simultaneous instances of “dice(roll:d6)” for its dungeon game), and won’t be using document retrieval that needs that capability, then the gpt-4-0613 trained simply on functions should work for you.

The greater ability and lower perplexity of the non-turbo gpt-4 model (even at the same temperature) should carry over into producing the tokens of function arguments.

I faced the same issue, After experimentation found that GPT-4-turbo is better than GPT-3.5-turbo in function calling

If you have budget constraints then GPT-3.5-turbo-16k is much better option.

That being said, Experiment for your use case