Okay - and are the prompts unchanged from your earlier tests proclaiming ChatGPT as a failure (where multiple tables are involved)? Saying the “exact same prompt strategy” is possibly different from saying “exact same prompts,” right? I just want to understand what has changed from your first experiments to the latest video.
Also, I have never seen a single successful SQL inference that was less than 256 tokens. Why did you fix that at 150? And more specifically, have you tested your outcomes at higher and lower values to benchmark the outcomes?
Help me understand - are these queries the actual prompts you are sending to ChatGPT (i.e., stocks in industry Packaged Foods)? If not, could you share the prompt that would have been generated from this query? I also see mixed case in the queries; why the inconsistency?
And how do you actually know this is the case?
When I test outcomes, I perform hundreds of them in an automated process that I can easily repeat. Each outcome is validated by a human and ranked for accuracy and performance. I’ve come to learn that evaluating success one at a time without any formal test protocol or metrics is unreliable and generally biased.
Most importantly, any change to the prompts or other settings requires a full test battery and re-assessment. This approach tells me how prompts compare to other prompt versions. We then rank prompt versions to determine the highest success. A single prompt development process might have dozens of different versions and thousands of tests.