Earlier this year, I used GPT-3.5-turbo via the API to retrieve disease-related information for viruses. This process allowed me to collect data on diseases for over 15,000 viruses out of a total of more than 40,000.
Over the past few days, I have attempted to repeat this experiment using the GPT-4o and GPT-4o-mini models. However, to my surprise, the results have been significantly worse, even when using the same system prompt and input/output examples.
I also tried re-running the experiment with GPT-3.5-turbo, but the outcomes were similarly poor. Despite testing various iterations—switching between models, reusing the original prompts, refining them, and changing the reference examples—none of these approaches have worked, not even for entries where I successfully gathered data in the initial attempt.
Could this be due to some filtering applied on OpenAI’s end? Is it possible that the disease information is being flagged as “dangerous”?