How do you teach end-users how to prompt engineer?

I have created an RAG application using the GPT-4 API. The database consists of state-wide real estate regulations and related publications, and the anticipated user base will be real estate students, agents, brokers, firms, publications and lawyers. These will be busy, probably low (relatively speaking) computer skills people looking to quickly get a good answer to issues they are dealing with. What I mean to say is that most of these people will probably NOT have used ChatGPT or similar tools before.

That said, how do you quickly teach people like this the prompting skills they need to get the best answers possible? I think it’s safe to say that we all agree there really isn’t a one-prompt-fits-all model, even when you have a dataset that is focused in one area of expertise.

Surprisingly, the app does very well (when using GPT-4) at answering most questions. So far, most of the real estate exam questions I have fed it (when I was able to get it to understand the question) have been answered correctly.

And, there is the rub, because these exams are designed for humans capable of critical thinking and with a basic knowledge of the world in general, and real estate law in particular, that the AI doesn’t possess. It can only regurgitate what it is fed.

So, the person asking the question really needs to be fully aware of this, that the manner in which the question is formed is almost as important as the question itself.

In the documentation I have created so far, I try to stress these points:

  • be as specific and clear as you can
  • do not assume the AI knows anything, because it doesn’t
  • the AI doesn’t think, it reacts
  • do not depend on keywords because the AI evaluates based upon context (meaning, ideas), not keywords
  • if you don’t get an answer the first time around, try re-phrasing, re-wording or even re-thinking your question

Any other suggestions? There has got to be more than a few of you out there with similar query systems providing technical documentation in various professional areas. I’m sure you are running up against the problem of very simple questions not being answered correctly, not because the information was not found, but because the AI didn’t understand the question.

How are you dealing with this?

Any and all responses appreciated.

3 Likes

What if I want to depend on keywords?

Lesson 1: The AI is always on a knives edge of not working. Change a single inconsequential word:

And therefore context warmup is important along with fully-defining the scope of an answer. Especially with a vector retrieval system, the user input must be robust beyond the system programming prompt in order to return information of the correct domain (and that database lookup be contextually programmed to understand the state of conversation).

The AI knows tons of stuff - it just doesn’t know it knows. It can’t think or reflect or refine internally, the only thing it does is generate output, and can’t ponder that output until it is produced.

Users should also be trained on hallucination, especially in such professional applications where Australian and UK real estate knowledge and internet randos will be intermingled with knowledge augmentation. Do not trust the convincing AI.

The big user failing, that one can often see in the forum, is instructions or specifications that a human also can’t understand how to answer.

3 Likes

Hah! You’re telling me?

Add one word, and then:

The question remains, how do we explain this to the end-user for whom RAG means something they wash dishes with or wrap around their heads?

What does this mean?

Yes, but how do you explain this to the user? How do you tell someone just looking for a simple answer to a simple question, “your input must be robust beyond the system programming prompt”?

My tip has always been to pretend you’re a kindergarten teacher talking to children. Don’t use big words. Use simple sentences. Write clearly. Don’t assume it can read your mind. Don’t use complicated formatting.

Don’t treat it like you’d treat a robot. It doesn’t really understand the computer-speak and you shouldn’t give it formatting like JSON and Markdown unless you have to. Some people use things like START and END, don’t do that.

Also something completely unintuitive is that LLMs are bad at numbers. So assume that the child has dyscalculia too.

3 Likes

Warmup: preloading the context with the topic background, asking general and progressively specialized questions to serve as multi-shot before asking the ultimate question or proceeding to the task.

For example, you want to program using Vue.js? Get the AI to describe it, the framework, outline the modules and libraries you’ll be using and enhance that knowledge yourself in conversation history before just dumping your code block and asking for an improvement.

You want to ask exam questions about title law? Ask questions about the framework of local law and administrative rules, guarantees, encumbrance, diligence, before going right to the quiz.

2 Likes

Yeah, that’s tough on it. It can still handle it, especially GPT-4, but it seems to get confused more easily that way.

Note that the AI actually converts the JSON into human language at the start, so it can process it properly. It seems to do this on things that it has trouble understanding. That’s probably a hint on what format it prefers.

I usually replace START and END with something like =========== which doesn’t carry the kind of meaning that START/END does

2 Likes

So do I.

Thanks @smuzani and @_j ! Great ideas!

I think the overall goal would be to not need to teach users how to craft a prompt.

Think back to the early days of search. To get really good results you needed to make heavy use of Boolean operators and all sorts of filtering options[1].

Now, the vast majority of users can find what they’re looking for with deeply broken syntax, misspelled words, and factual inaccuracies in their query.

My advice would be to invisibly handle the necessary prompt-crafting on the backend to make for a better user experience.


  1. Note: Google still offers Advanced Search ↩︎

3 Likes

I’ve been trying to make it easier for end users without prompt whispering knowledge…

https://community.openai.com/t/a-tale-of-two-prompts-wake-up-in-morning/326888/1

I know, I know… but, it’s extremely difficult when you have large, complex datasets with hundreds and eventually thousands of users with as many styles of posing questions.

One thing I have done is to use the process of drilling a question down to it’s core “concept”. The other is creation of the “standalone question”, which combines the concept with the chat history. I’ve also done some work summarizing related texts in the background, having the end result of providing more context to the context documents.

But trying to anticipate every single possibility is just far easier said than done. Not to mention the additional cost per query.

I mean, I see your point. Trying to teach complex boolean operations to most folks is a losing proposition. But I think that continually hammering the idea into users: This is an AI, it’s not human, keep your questions clear, specific and detailed. Like was said earlier, pretend you are talking to a child in kindergarten – with dyslexia. Over time I think this will sink in with at least some users, for sure the ones who are using it the most. They won’t be perfect, but they will at least have a fighting chance at getting better answers.

Getting different output from gpt-turbo and making “prompt engineering” easy for end users…

1 Like

Yes.

I drill each question down to it’s core “concept”. The other is creation of the “standalone question”, which combines the concept with the chat history. I’ve also done some work summarizing related texts in the background, having the end result of providing more context to the context documents.

But, I have found that sometimes you’ve got to let the user enter his own prompt in his own way with the details the AI needs to find the correct response, especially when the user knows way more about the documents being retrieved than the AI or you.

1 Like

Yeah, it’s definitely about finding a balance, I think.

I think we’ll see even more specialized tools rather than swiss army knife this does everything tools…

One thing you might consider, use gpt-3.5-turbo as your intermediary with a bunch of embedded examples of bad/weak prompts and corresponding good/strong prompts.

Basically, take whatever their input is, create an embedding for it, search your vector DB for example prompts similar to the user’s, and inject them as few-shot examples to gpt-3.5-turbo with a good system message directing it to act as a prompt-editor.

As users continue to use your system, keep adding their bad prompts and the improved prompts into your database. So, the more they use it the smarter it gets.

Periodically, you might want to manually review/edit some of the pairs in order to maximize example quality, but I think it’s doable.

I’d probably need to see some examples of the inputs your users are submitting and what ideal inputs would look like to give any more advice.

1 Like

In actuality, so far in testing with a very small group, gpt-4 has been performing remarkably well. But, this is a very interesting idea. I do keep track of all queries and responses, although I don’t currently have a way of telling which are good and which are bad – although the bad ones are normally the ones that could not be answered, and they all have similar responses based upon the system message directive.

This sounds like an interesting add-on optional feature – one that would probably be even more useful for gpt-3.5-turbo as the query model.

This is my current query flowchart:

Where would you insert this prompt-editor query?

1 Like

I would put it right at the beginning—as soon as they submit the question.

First, I would try to figure out what their goal is, e.g. what do they really want to know, etc.—similar to the step where you try to identify the “concept.”

Then, once you have that I would prompt the model something like,

system: You are a language model designed to help users ask clear, concise, meaningful questions which are more likely to be able to be answered correctly immediately without requiring follow-up questions.

[Perhaps some details about the scope of possible questions, what the service is, etc here]

When a user submits a question you will distill it down to it’s most direct, concise form and identify the user’s ultimate goal.

[Perhaps a few-shot example here.]

user: Q

Then, once you get the response which should be a “better” form of the question, you could do a few things.

  1. You could continue as you have been, but with a better question.
  2. You could use the cheaper gpt-3.5-turbo to generate a plausible answer, generate an embedding for the plausible answer, and use that to identify documents for retrieval (This is HyDE and it has been shown to dramatically improve the quality of retrieval in many settings.
  3. Get an embedding of the new question (or use extracted keywords) to retrieve some documents, then use gpt-3.5-turbo to generate a hypothetical document response, then embed that response to pull (hopefully better) documents into context, then generate the final response with gpt-4.

Another approach you might take is to do some initial probing of your users.

  1. Question comes in
  2. Use the cheap model to generate what you consider a higher quality question.
  3. Respond back to the user something like, “If I understand correctly, you want to know [better question]. Is that correct?”
  4. If the user corrects the model, use the new information to generate a new, better question. [Iterate 3 and 4 if you like, but one Q back to the user should be sufficient.]
  5. Continue as before or use one of the HyDE methods I described above.

You could even make step 3 optional if you ask the model to determine if it needs to ask for verification or not or if you do a keyword/embedded question/HyDE document retrieval and the best matches are below some threshold, so the model only verifies the enhanced question if there aren’t any high-confidence retrievals.

The vast majority of these processes can be done with gpt-3.5-turbo and an embedding model, so they’re very cheap by comparison. In fact, I think you might find that you don’t even need gpt-4 in most cases. If you happen to get one or more very strong matches for your retrievals, you might find gpt-3.5-turbo can generate a response on par with gpt-4.

Then you’d only need to bust out the expensive model when there is a complicated question with weaker retrievals from disparate documents and some reasoning might be required.

In any event, if you have any means of determining the quality of responses (polling the user for instance) you can add the new Q/A pair to your vector DB for retrieval later.

1 Like

Thanks. Looks like I’ve got a little reading up to do, but might be able to work something like this in. It was the reason I designed the query/response log in the first place, although my thought was to use it to fine-tune once that became possible for gpt-3.5/gpt-4. But, embeddings will work also!

Good idea.

1 Like

I completely agree with your points, SomebodySysop. In my experience, the best way to teach is often to show. Therefore, creating some tutorial videos demonstrating how to effectively prompt the AI could be extremely helpful for your user base. These videos could provide real-time examples of how to phrase questions, rephrase if necessary, and get the most out of the AI. This could be a valuable resource for your users, especially those who may not be as tech-savvy.

1 Like

How can you guide the user to get the best experience from your fine grained knowledge retrieval system? I have seen one of your videos and it’s a really pragmatic approach to handling knowledge in the context of LLMs.

But at the end of the day the issue is that a language model has a very broad API, technically speaking. The user can still go ahead and ask about recipes or anything else (hopefully domain related).

Now, in fact the same problem can arise with every other product and service but usually from the perspective of the developer adding more and more features until the core value is hidden behind a large pile of options.

In both cases the solution is often to reduce choices by design. Imagine three buttons: explore topic, detailed insight, create connection. And assume that most users will have one of these three problems that they want to solve using your service.
Then you would take whatever input they provide, categorize it and call a function to get the matching prompt that you know will get the best results for that type of question and add the user query.
You have effectively created the buttons.

Of course this concept would have to be adjusted to the specifics of your case. This is taking a generalized solution approach to the case of LLM services.

Hope this helps!

2 Likes

Mea Culpa! Look at the examples I posted here: How do you teach end-users how to prompt engineer? - #3 by SomebodySysop

First off, you can’t even post a multiple choice question in my default configuration because my “concepts” (my version of a Hyde implementation) logic will rewrite it.

The only way to make this possible is to give the user the option to “turn off” the “concepts” logic. One in the the pile of options.

But note the subtle difference between the prompt that failed to find an answer, and the prompt that succeeded (though barely). As a human being with many years training in critical thinking and a good sense of the world around me, it was pretty obvious that “appraisal” is really jargon for “appraisal report”. The LLM doesn’t have this ability.

So, this sort of situation could repeat itself in a thousand more situations with hundreds if not thousands (if we are lucky) of different users.

Because of the size and randomness of the possibilities, not to mention the technical nature of the content being analyzed, I just don’t see how prompting, in this use case, could be improved by a 3 button solution.