How consistent is GPT4 via api?

We’re using GPT-4 via api to parse some data. We’ve noticed that while sometimes it’s spot on, very slight (no semantic difference) modifications to the input data cause it to parse very incorrectly. We’re using temperature 0. Is this variation to be expected? We’d expect the model to predict the same output every time even if the input is slightly different (ie different line items deleted).

I’m seeing something similar: GPT-4 becoming dumber sometimes, for a while