Differences in Assistant Playground behavior between GPT 4 and 3.5 based assistants

My task is to extract certain facts from user-supplied text and return them in a structured format. I’ve written code that uses the assistants api to accomplish this. My code is meant to run in batch-like mode (no end-user interaction). Presently the GPT is not working as well as I’d like, so I’m using the assistants playground to improve accuracy.

My process is: In the assistants playground, I tweak the instructions, enter my input text and examine the model’s response. I’ve tried working with both GPT 3.5 and GPT 4. If I can get this working in GPT 3.5, there’s a significant cost savings.

There is a significant difference between how I interact with GPT 4 vs GPT 3.5 in the assistants playground:

  • Using GPT 4, after I enter the initial user message and receive the initial assistant message, I can have a back-and-forth follow-up interaction with the assistant to ask it about its thinking and point out the mistakes that it made. Then the GPT will realize the mistake and I can ask it how I should change the instructions so that it doesn’t make the same mistake again. If I want to provide new input text, I tell the model “let’s try a new input” and it happily processes it.

  • Using GPT 3.5, the model tries to extract facts from every user message. (whereas GPT 4 tries to extract only from the first user message). So I can’t have a back-and-forth to try to figure out what this model is thinking.

Can someone shed some light on (1) why this behavior is different and (2) whether I can control this behavior and if so (3) how?

My assistant instructions:

You are an assistant that extracts information about events from my input. There can be any number of events in my input. You will extract these fields from each event: description of the event, start time and end time. Your response to my input should contain the event information in structured json output without using markdown code blocks.

When the input contains time zone information after the timestamp, please include it in the output.

When the input contains a pair of timestamps next to each other separated by a dash or minus sign or hyphen, these timestamps are the start time of the event and end time of the event.

When the input contains a pair of timestamps next to each other separated by a pipe symbol or a forward slash character, the first timestamp is the start time and the second timestamp should be ignored.

When the input contains multiple timestamps separated only by spaces, the first timestamp should be treated as the start time and ignore the other timestamps when they aren’t clearly specified to represent a different event or an end time in the text.

My initial user message contents:

I woke up at 6:00 AM EST / 11 AM GMT. I brushed my teeth 6:15 AM EDT - 6:20 AM EDT. At 6:30 AM GMT I started breakfast. The alarm went off at 6:25 AM EDT 9:25 AM PDT. I turned off the oven at 10:00 AM EDT | 3:00 PM GMT.

You should at least at a lot more instructions about the JSON that you would like to receive, especially since you have a lot of different options (one event, more than one event). You say structured json, but you don’t tell it about the structure.
SInce you have 1 to n events you might also consider letting it call a function to stroee your events? Might be easier since you are already running a batch process?

And I would not even try this with 3.5 models. Will be too hard to get right. Go with 4.

In another GPT 3.5-based assistant, I was able to ask follow-up questions (my initial problem is that I wasn’t able to ask follow questions) and I asked the model how it decides whether my input is a question about previous user messages or new data to be analyzed and extracted. It basically responded with something like "I determine whether to treat your input as a question or to extract information from it based on the content and structure of the input. ".

So 3.5 tries to figure it out but it seems to get it wrong sometimes. It also seems wonky at other times - infact it started responding to my questions in JSON format.

@ jlvanhulst - both 3.5 and 4 models do a surprisingly good job at creating well-structured, relevant and usable JSON output “out of the box” so I did not have to specify the exact structure. Also both models will respond with a JSON array when multiple events are extracted from one user message. As far as your suggestion to create a function to store events, I plan to do this by taking the responses from the assistant messages in the thread - I think that a function might not be warranted here.

But if you don’t specify the names of the attributes to be used in the JSON you WILL get different attribute names from different requests. So that would be the very least you will have to add to the prompt.