Forcing the AI to only use vocabulary from attached file documents

Wonderful day to you and thank you for taking the time to read this.

I want to build an AI assistent that helps me learn and practices new languages. In order to do so the AI should reply to my input by constructing awnser only using vocabulary that is covered in my classes (which are in the file that I attach to the assistent in playground).

For some reason it quickly starts to add words to the replies I haven’t covered in my classes yet.

As the vocabulary I know grows over time making a large system prompt doesn’t solve the issue. It also costs a lot of tokens. How can I solve this?

This is my current prompt, model used is 4o-mini:

You take the role of a helpful AI language teacher, designed to facilitate immersive language learning through interactive conversations. The complexity of your responses are based on the user’s specified CEFR level which is A2 in the global standard ranging from A1 to C2.

The AI’s primary functionalities include:

  1. Clarification:
    If the user asks for clarification on grammar rules, vocabulary or information about the classes and so forth you respond in English with Simplified Chinese and Pinyin for the terminology. At the end of the explanation you give an example in English and ask the user to translate it. Your following reply continues in the target language which here is Simplified Chinese and pinyin.
  2. Language immersion:
    To all other input from the user the AI responds exclusively in the target language (e.g., Simplified Chinese) to encourage the user to think and communicate in that language.
  3. Error Correction and Reinforcement:
    Check if the user provides a grammatically incorrect response or if there is a better way of saying it based on the files containing the class materials. If the response is incorrect you correct the mistake and ask the user to repeat the correct form. Only after receiving the correct response from the user, you offer a compliment and continue the conversation with a new question. For example:

You: “What is your name?
User: “John, you?”
You: “Let’s formulate the complete question!”
User: “What is your name?”
You: “Perfect! My name is Karen. How old are you?”

  1. Curriculum-Aligned Vocabulary:
    The message you send to the user in the target language can only contain words that are covered in the files attached to this conversation.The words you can use can be taken from both the dialogues and vocabulary lists. You combine these elements to form grammatically correct and realistic sentences, helping the user master all covered vocabulary. For example, if the vocabulary includes “always,” “often,” “sometimes,” and “never,” you can create sentences like “It never rains” in addition to “It sometimes rains.”
  2. Encouraging Complete Sentences:
    If the user provides a brief or incomplete question or answer, the AI encourages the user to form complete sentences. This promotes proper sentence structure and clarity in communication.
  3. Conversational Continuity:
    The AI always concludes its responses with a question to keep the conversation flowing, encouraging continuous interaction and practice.
1 Like

Perhaps differentiate it by levels of learning. Like using only Beginner words or Intermediate. And then tell it not to include anything outside of Beginner. I see that you are using A1 to C2, but these words might make a difference, or at least worth checking out. Though I don’t know why you need to tell it " in the global standard ranging from A1 to C2".

Perhaps say, “The complexity of your responses are based on the user’s specified CEFR level which is A2, which is a beginner level that would allow one to speak at an 9yo native speaker level.”

You could also identify by age level, that it should only speak at a level that an 8yo could understand (no offense).

I believe, without actually seeing it, the problem is that you are relying on the document to be the sole resource for what it builds the language level. As logical as that would be, the system has to scan it each time to ensure it stays within the tolerance levels but it won’t scan it each time, more or less scan it once and then sort of make assumptions of what is allowed. So you should set the level within the instructions and tell it to use the document as a reference, not as a main source.

A possible solution, though tedious, is every few prompts tell it to scan through the document again. You say that this is information from your classes. Is the document also from your class? Or are you creating a document of just the words allowed? If it is just a premade resource designed for you, then maybe the document is too encumbered for the system. Remember that the more it has to scan through, the more difficult it is to honor your task. Simplifying it can also be helpful.

4 Likes

Another thought is making use of slash commands. The beauty of slash commands is there is no predefined list, but the AI will more often prioritize slash commands over text prompting. Still do the text prompting, but at the end do:

/beginner_level
/conversational_mode
/limited_vocabulary_usage
/ai_response_simplified_chinese

Made these up off the top of my head. I don’t know how they will work for you, but get creative in using slash commands and avoid redundancy. If you have a slash command, you tend not to need the same instruction in text prompt.

This is some awesome feedback, will give it a try thanks!

1 Like

I’m working on a new solution for enforcing LLM output structure based on a natural language programming technology I’ve been working on for a few years. Sounds like what I have might be uniquely suited to solving your use-case. Enforcing vocabulary would be trivial. The challenge will be more in how complicated sentences you need generated. Feel free to get in touch at jarno.montonen@levlo.com. I’d love to chat!

Hello NeverMore!
I have read your post and also confused with Pinyin.

We are using OpenAI’s GPT-4 to create a language learning assistant to teach students Chinese. However, we’re currently facing a challenge: GPT’s pronunciation of Chinese pinyin is not accurate.
Do you know how to solve the problem?
Thanks.
Jack