Hey , I created an assistant meant for a specific data extraction task. When I was testing it using Playground the results were very promising, at worst satisfactory.
Recently, I moved into prototyping with node.js and I realized that the quality of responses from the API is drastically worse. The results returned by the API are wrong/useless in more than 50% of cases.
I could only achieve comparable output if I switched from gtp4o-mini-07-18 to gpt4o-08-06 in my API calls. This is not an optimal solution for my use case both financially and due to much longer inference time.
I made sure to not override any settings in run call:
I am sending a stringified JSON object with classified data to perform extraction on, with fields being described and referred to in the system prompt and json_schema.
I went into Dashboard->Threads. Both run instructions and messages are identical between the two source calls.
So we need to really dig deep. There must be some very subtle differences here.
I am wondering why you are using this? Not inherently wrong, but does raise some concerns and always can introduce very hard to catch bugs.
Let’s build a very simplified example of calling the OpenAI endpoint without having some state.
In fact, if you’re comfortable, let’s just use an easy to investigate language like Python in Jupyter. You can bootstrap an api call via assistants rapidly.
Thank you for your help , the issue turned out to be some very subtle whitespace difference within the stringified json, which I missed at first. It turns out the model understand human-formatted JSON much better.