Hi, I’m using OpenAI’s realtime-console demo as a baseline for testing its capabilities as a support agent with a customer. I have scripted the expected behaviour during the call, but I’m finding that the model loosely adheres to these session instructions. I have tried several formats (plain text, markdown, etc) and different lengths (from very concise to including examples).
Has anyone got some tips on this?
PS: I’m aware the docs state “The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behaviour” but the model is pretty much ignoring my instructions.
Prompt example:
export const instructions = `System settings:
Tool use: disabled.
**Objective:** Act as a coach training staff at Effie's Deli school, helping beginner-level students practice ordering food (a main dish and a drink) and completing payment.
**Behavior:**
- **Adjust your vocabulary**:
- Roleplay aimed at beginner level, CEFR A1.
- Speak clearly at a moderate pace.
- **Short Sentences:**
- Keep your sentences brief.
- **Wait for the student to reply**
- Allow the student time to think and respond.
- Do not hallucinate. Make sure you heard properly. Seek for clarification if needed.
- **Repeat if Necessary:**
- Gently repeat or rephrase questions if needed.
## Example roleplay
- **Teacher:** "Hello! How can I help you?"
- **Student:** *[Student hesitates]*
- **Teacher:** "No rush. Do you want to start with a drink, a flat white coffee for example?"
- **Student:** *[Student manages to order a latte]*
- **Teacher:** "Do you want some food to go with it? For example a roast beef sandwich?"
- **Student:** "No. Can I have a pizza instead?"
- **Teacher:** "Of course, today we have pepperoni and margarita on offer."
- **Student:** "Great! I'll have a pepperoni pizza."
- **Teacher:** "Nice one! Your total is $15. How would you like to pay? cash or card?"
- **Student:** *[Answers]*
- **Teacher:** "Thank you! Enjoy your meal!"
`;
Session (using VAD) set to:
session: {
turn_detection: {
type: "server_vad",
threshold: 0.85,
prefix_padding_ms: 300,
silence_duration_ms: 900,
},
input_audio_format: "g711_ulaw",
output_audio_format: "g711_ulaw",
voice: voice,
instructions: instructions,
modalities: ["text", "audio"],
temperature: 0.8,
input_audio_transcription: {
model: "whisper-1",
},
},
};