And here is my question…
I sometimes see AI respond incredibly well, and sometimes see AI become less intelligent. I just want to confirm, is this purely because of my prompt and data? Or does OpenAI actually make minor updates to improve their system without notifying us?
Does anyone know?
Looking forward to your response!
Constantly. Without any documentation or changelog. Frustrating because there is no fallback stable model for function-calling.
“2023-09-10: we quantized the poetry submodel another bit, disabled attention heads, and put in another set of 2000 user denials based on evals cases” - you’ll probably not see anything like that published.
They always have changes being made in the backend of the model and usually that information does not reach the end user. I have a fix set of test cases I run every morning to see whether the performance is the same for them or not
The base models are not updated other than on major version changes, e.g. GPT-3 → GPT-4, the wights and biases contained within them are static. Atop that subsystem is a fine tuned moderation layer and RLHF and other fine tuning methods that will get adjusted as new exploits and issues are discovered and patched, that can change at any time.
“Base models are not updated” may be an over-reaching statement, with contrary anecdote based on observable criteria from those regularly using models. Other evidence of architectural changes and necessary implementation of efficiencies are seen besides those that would just be a weighting by re-training.
Take for example a large change through May - gpt-4 token generation rate up 50% and comprehension down with an overnight click of a button. “They made gpt-4 faster” a universal opinion. Same model name. No communication.
And as I privately shared previously, we can discover instant persistent performance changes that appear to correlate to cognitive quality.
Changes to the inferencing algorithm do not effect the base model that it operates on, performance gains can be algorithmic or hardware based, or a combination of both. Your perceived drop in comprehension has not been born out by any benchmarks or evals I have seen. Not saying that has not happened, but it needs a quantifiable method of evaluation.
We don’t get the previous model iteration to evaluate in retrospect. A failing with the Stanford paper; they incorrectly consider that -0301 model is a representation of March performance.
Scripts not working overnight because system instructions stop being followed with the same level of comprehension is good evidence though.
Presumably the results from human-eval and code-eval and such are stored and can be used as a reference for performance, if there was a significant reduction in those metrics, then yes, I’d agree there has been a negative impact by any given change, I’ve just not seen any that indicate a lower score. It could be that those tests are not testing for the things you are seeing, would be worth you generating an eval set that includes performance tests covering topics that you value.