Assistant Not Following Instructions


I know this is a known problem but maybe someone happened to know a workaround or solution for this.

I am using the model gpt-3.5-turbo-1106 model as an assistant, its task is to extract a value from input text and if it is not there ask the user to either enter ‘1’ for option 1 or ‘2’ for option 2 and never accept any other value than ‘1’ or ‘2’ and ask the user to re-enter the correct value, But what happens is it ignores this instruction no matter how I phrase it and it can accept input from user other than ‘1’ or ‘2’.

Anyone know a way to make the model follow instructions correctly and not actually ignore it? Is there a certain keyword I am missing?

Thanks for any help!

Do you mean you use gpt-3.5-turbo-1106 IN the Assistants API?

If so, Assistants API is still in beta and is highly uncontrollable.

First, I use it in the Playground and then after I am done perfecting the instructions and testing the assistant and its replies I get the assistant key and my API key and use it from the API.

There is absolutely now way to do the full cycle of prompt engineering in the playground. What seems to be getting good results in playground will turn out a nightmare in production as you cannot possibly do hundreds (and sometimes thousands) of test of a prompt. If you haven’t done so, you haven’t done prompt engineering.

And back to my point - Assistants API are still in beta and even a robust prompt engineering process won’t make them follow your instructions perfectly.

How do you suggest I should test the prompt and the assistant then?
Also, shouldn’t the results be somewhat similar between the playground and the production?
Given that I am using the same model and same instructions.

Also, I am aware that manual testing the prompt is not a reliable way to validate the assistant, but there isn’t really a way to automate generating inputs or at least right now, also given the simplicity of the task it shouldn’t be that complicated to get somewhat always acceptable results.

LLM’s are stochastic by nature, meaning that if you tested a prompt 10 time and it seems to be working right, doesn’t mean that on the 11th the result will still be the same. For the LLM apps that require predictability of results you have to do mass testing (from what I see everyone is using their own solutions as of now).

1 Like

I have a similar problem and it seems to get worse as the thread gets longer. The Assistant just ignores the instructions passed to it. I want to specify certain formats for certain messages and it doesn’t work in the assistants API whereas it does if I take the messages from an assistant’s thread and pass them to Completions with a system prompt.

I’d like to know under the hood how I can get the assistant API to work because I like maintaining the thread like nature and it’s much easier on the token usage for some cost savings. :upside_down_face:

And back to my point - Assistants API are still in beta and even a robust prompt engineering process won’t make them follow your instructions perfectly.

What I found useful is sometimes to break the instructions into smaller tasks and assign each task to an assistant, it was less likely to fail in that case and it had clearer instructions compared to having one Assistant do all the work with a lot of instructions to do and different cases to handle. Also what is known as Prompt Chaining.


Have you tried to not use the assistant but a simple script to proceed with the conversation?

Instead of sending the user reply with 1, 2 or anything else back to the model you run a script to check the user message and then return to the assistant with “The user wants Option 1” or return to the user with “Please select Option 1 or 2”?

This way you should get robust results and save a few tokens.

1 Like

The opposite of helpful. Obviously it’s beta, I’m pointing out difficulties I’ve had to see if people have workarounds or other practices to accomplish the same task.

1 Like

What I’ve gotten to work is putting instructions directly in the message thread. It works great but it “feels” wrong since I’m just using the API to instruct the model to give a user a reminder for something or asking the model to ask the user a question they can respond to. The UX so far has been fine but if you printed out the messages in the thread it would feel weird since some are obvious prompt engineering messages not said by users.

Oh well, it works!

GPT 3.5 has a rolling window which moves with your inputs and outputs. The window is not overly large, seeming to be about 7k tokens into the past imo. This is why putting instructions directly in the message thread works.

When i question the model about its window size it claims “the specific details regarding the size or duration of the rolling window for conversations in the GPT-3.5 model are not publicly disclosed.”

But we could always check the model by testing it. I haven’t done this yet.

Start a conversation with the AI and ensure it reaches a substantial length, well beyond what you suspect the conversation window might be.
At some point during the conversation, introduce a unique and easily recognizable token or phrase that doesn’t commonly occur in everyday conversation. This could be a random sequence of characters or a specific word that’s unlikely to be used naturally.
Keep the conversation going for a while longer, ensuring that the unique token remains within the conversation window.
After reaching a certain point in the conversation, start monitoring the AI’s responses to see if it still shows awareness of the unique token. If it continues to acknowledge or reference the token, it suggests that the conversation window is larger than the current portion of the conversation. If the AI stops referencing the token, it might indicate that the conversation window has been exceeded.

If you happen to test this please drop me a note. I would like to know your results.