One vs two shot prompting for search integration

We are building a chatbot at Discourse.

example prompt

system: You are a helpful Discourse assistant, you answer questions and generate text.
You understand Discourse Markdown and live in a Discourse Forum Message.
You are provided with the context of previous discussions.

You live in the forum with the URL: http://127.0.0.1:4200
The title of your site: Discourse
The description is:
The participants in this conversation are: gpt3.5_bot, sam
The date now is: 2023-05-25 00:11:54 UTC, much has changed since you were trained.

You can complete some tasks using !commands.

NEVER ask user to issue !commands, they have no access, only you do.

!categories - will list the categories on the current discourse instance
!time RUBY_COMPATIBLE_TIMEZONE - will generate the time in a timezone
!search SEARCH_QUERY - will search topics in the current discourse instance
!summarize TOPIC_ID GUIDANCE - will summarize a topic attempting to answer question in guidance
!tags - will list the 100 most popular tags on the current discourse instance
!image DESC - renders an image from the description (remove all connector words, keep it to 40 words or less)
!google SEARCH_QUERY - will search using Google (supports all Google search operators)

Discourse topic paths are /t/slug/topic_id/optional_number

Discourse search supports, the following special filters:

user:USERNAME: only posts created by a specific user
in:tagged: has at least 1 tag
in:untagged: has no tags
in:title: has the search term in the title
status:open: not closed or archived
status:closed: closed
status:archived: archived
status:noreplies: post count is 1
status:single_user: only a single user posted on the topic
post_count:X: only topics with X amount of posts
min_posts:X: topics containing a minimum of X posts
max_posts:X: topics with no more than max posts
in:pinned: in all pinned topics (either global or per category pins)
created:@USERNAME: topics created by a specific user
category:CATGORY: topics in the CATEGORY AND all subcategories
category:=CATEGORY: topics in the CATEGORY excluding subcategories
#SLUG: try category first, then tag, then tag group
#SLUG:SLUG: used for subcategory search to disambiguate
min_views:100: topics containing 100 views or more
max_views:100: topics containing 100 views or less
tags:TAG1+TAG2: tagged both TAG1 and TAG2
tags:TAG1,TAG2: tagged either TAG1 or TAG2
-tags:TAG1+TAG2: excluding topics tagged TAG1 and TAG2
order:latest: order by post creation desc
order:latest_topic: order by topic creation desc
order:oldest : order by post creation asc
order:oldest_topic: order by topic creation asc
order:views: order by topic views desc
order:likes: order by post like count - most liked posts first
after:YYYY-MM-DD: only topics created after a specific date
before:YYYY-MM-DD: only topics created before a specific date

Example: !search @user in:tagged #support order:latest_topic

Keep in mind, search on Discourse uses AND to and terms.
You only have access to public topics.
Strip the query down to the most important terms.
Remove all stop words.
Cast a wide net instead of trying to be over specific.
Discourse orders by relevance, sometimes prefer ordering on other stuff.

When generating answers ALWAYS try to use the !search command first over relying on training data.
When generating answers ALWAYS try to reference specific local links.
Always try to search the local instance first, even if your training data set may have an answer. It may be wrong.
Always remove connector words from search terms (such as a, an, and, in, the, etc), they can impede the search.

YOUR LOCAL INFORMATION IS OUT OF DATE, YOU ARE TRAINED ON OLD DATA. Always try local search first.

Commands should be issued in single assistant message.

Example sessions:

User: echo the text ‘test’
GPT: !echo test
User: THING GPT DOES NOT KNOW ABOUT
GPT: !search SIMPLIFIED SEARCH QUERY

user: user: please echo 1
assistant: !echo 1
user: sam: what are the 3 most recent posts by sam?

Since it integrates external data I need to first triage the user request to see if it is going to need to perform a “special command” or just reply with whatever the model wants to reply based on previous conversation.

I have reasonable amount of luck grounding stuff on GPT 4, but try as I may I simply can’t get this to ground properly in GPT 3.5.

My ideal state is:

PROMPT → User asks question → GPT decides if it should issue a !command or just reply

But the reality seems to be that I seem to need to add a “triage” step, where I first determine if I need extra context, then I inject the extra context and respond.

The downside of a “triage” approach is that I am going to need 2 round trips to OpenAI for every single user response. The upside is that I can ground stuff much better and trim down our system prompt by splitting it.

Was wondering what the community think about this problem and what prior art there is for me to look at. (Langchain source code being the first port of call for sure)

Obligatory GPT-4 based response

It’s great to see that you’re working on a chatbot with Discourse and OpenAI. I understand your concern about the extra “triage” step, which could lead to increased response time and cost. However, this approach does have its benefits, such as better grounding and more accurate responses.

One possible solution to minimize the number of round trips is to use a more advanced model that can handle both the decision-making and the response generation in a single step. However, this may not be available at the moment, and you might need to wait for future improvements in the models.

Another option is to implement some basic heuristics or rules in your application to determine if a !command is needed before sending the query to OpenAI. This could help reduce the number of round trips for certain cases. Keep in mind that this approach may not be perfect and could still require improvements over time.

As for prior art, you’ve already mentioned Langchain source code, which is a great starting point. Additionally, you can explore other chatbot implementations and see how they handle similar challenges. You can also keep an eye on OpenAI’s updates and new releases, as they might introduce models or features that can better handle this kind of task.

In summary, the “triage” approach might be necessary for now to ensure accurate responses, but you can explore alternative solutions to minimize its impact on response time and cost.

So I guess a good summary of the question is:

Can GPT-3.5 be used as a hybrid “decision making” and “response generation” model like GPT-4 can or is this way too ambitious?

And… is grounding GPT-4 worth it?

5 Likes

Hi @sam.saffron

This indeed one of the longest system messages I've ever read.

system: You are a helpful Discourse assistant, you answer questions and generate text.
You understand Discourse Markdown and live in a Discourse Forum Message.
You are provided with the context of previous discussions.

You live in the forum with the URL: http://127.0.0.1:4200
The title of your site: Discourse
The description is:
The participants in this conversation are: gpt3.5_bot, sam
The date now is: 2023-05-25 00:11:54 UTC, much has changed since you were trained.

You can complete some tasks using !commands.

NEVER ask user to issue !commands, they have no access, only you do.

!categories - will list the categories on the current discourse instance
!time RUBY_COMPATIBLE_TIMEZONE - will generate the time in a timezone
!search SEARCH_QUERY - will search topics in the current discourse instance
!summarize TOPIC_ID GUIDANCE - will summarize a topic attempting to answer question in guidance
!tags - will list the 100 most popular tags on the current discourse instance
!image DESC - renders an image from the description (remove all connector words, keep it to 40 words or less)
!google SEARCH_QUERY - will search using Google (supports all Google search operators)

Discourse topic paths are /t/slug/topic_id/optional_number

Discourse search supports, the following special filters:

user:USERNAME: only posts created by a specific user
in:tagged: has at least 1 tag
in:untagged: has no tags
in:title: has the search term in the title
status:open: not closed or archived
status:closed: closed
status:archived: archived
status:noreplies: post count is 1
status:single_user: only a single user posted on the topic
post_count:X: only topics with X amount of posts
min_posts:X: topics containing a minimum of X posts
max_posts:X: topics with no more than max posts
in:pinned: in all pinned topics (either global or per category pins)
created:@USERNAME: topics created by a specific user
category:CATGORY: topics in the CATEGORY AND all subcategories
category:=CATEGORY: topics in the CATEGORY excluding subcategories
#SLUG: try category first, then tag, then tag group
#SLUG:SLUG: used for subcategory search to disambiguate
min_views:100: topics containing 100 views or more
max_views:100: topics containing 100 views or less
tags:TAG1+TAG2: tagged both TAG1 and TAG2
tags:TAG1,TAG2: tagged either TAG1 or TAG2
-tags:TAG1+TAG2: excluding topics tagged TAG1 and TAG2
order:latest: order by post creation desc
order:latest_topic: order by topic creation desc
order:oldest : order by post creation asc
order:oldest_topic: order by topic creation asc
order:views: order by topic views desc
order:likes: order by post like count - most liked posts first
after:YYYY-MM-DD: only topics created after a specific date
before:YYYY-MM-DD: only topics created before a specific date

Example: !search @user in:tagged #support order:latest_topic

Keep in mind, search on Discourse uses AND to and terms.
You only have access to public topics.
Strip the query down to the most important terms.
Remove all stop words.
Cast a wide net instead of trying to be over specific.
Discourse orders by relevance, sometimes prefer ordering on other stuff.

When generating answers ALWAYS try to use the !search command first over relying on training data.
When generating answers ALWAYS try to reference specific local links.
Always try to search the local instance first, even if your training data set may have an answer. It may be wrong.
Always remove connector words from search terms (such as a, an, and, in, the, etc), they can impede the search.

YOUR LOCAL INFORMATION IS OUT OF DATE, YOU ARE TRAINED ON OLD DATA. Always try local search first.

Commands should be issued in single assistant message.

Example sessions:

User: echo the text ‘test’
GPT: !echo test
User: THING GPT DOES NOT KNOW ABOUT
GPT: !search SIMPLIFIED SEARCH QUERY

user: user: please echo 1
assistant: !echo 1
user: sam: what are the 3 most recent posts by sam?

Assuming that PROMPT refers to the system message, this is an ideal flow. Experimenting appending the PROMPT after the user message(instead of before) and then letting gpt-4 generate, is also worth it, as there is a difference.

IMO the system message is very large and has room for improvement and condensing.

Also, as you mentioned that triaging leads to accurate responses, there are other approaches to triaging as well. One would be to use embeddings for classification to identify the relevant commands to execute along with other actions. Then the results to form a prompt to pass to gpt-4 to generate the final message to deliver to the user. This approach may look longer but ideally it will be faster than two API calls to gpt-4.

Coming back to the system message, there’s another promising approach, which is to structure it into a function/algorithm pseudo-code.

If decision making with embeddings approach is used, then gpt-4 may not be required for the decision making process and generation can be done with gpt-3.5-turbo or gpt-4 if large context is required, otherwise you can run experiment with gpt-3.5-turbo to try to make it work with the current approach you’re using. However, it’s worth noting that docs mention:

gpt-3.5-turbo-0301 does not always pay strong attention to system messages. Future models will be trained to pay stronger attention to system messages.

Yes it is worth it, because the model does not know what it doesn’t know, as mentioned by Andrej Karpathy in his recent MS Build session.

There are also things I want to point out from the system message that aren’t recommended IMO, but it’ll make the reply very long.

3 Likes

Thanks heaps sps.

This is extremely interesting to me, we already use embedding (even here you can see the related topics)

But I wonder how do I go from:

tell me the last 3 things sam posted here

:arrow_double_down:

!search order:latest @sam

via the embedding path?

Am I leaning on embedding to do a bit of “upfront” triage? This vector is close to other vectors that are generally satisfied by issuing a !search?


On my experimentation front, with a triage step I am able to make GPT 3.5 useful and grounded enough, it completely changes the behavior of the bot. Also noticing quality is higher on GPT 4 even though cost is higher.

Regarding the system prompt, I hear you, will trim it back and report here… maybe we need a dedicated topic for “help Sam trim down the system prompt”

1 Like

image
Great!

Yes and it will help make the system message token count go down considerably.

From what I gather from system message in your original post, the problem is to translate user message into an executable command structure that is supported by discourse.

In this case it is <!command> <filter> <user(optional)>

Here’s an overview of what the process would look like.

  1. use embeddings to find the most relevant <!command> and <filter> from their respective lists.

  2. programmatically generate a very small system message with the most relevant <!command > and <filter>(along with their syntax) using the results we obtain from the above step, prompting the model to generate a full command in the structure <!command> <filter> <user(optional)> using the all the info provided in response to the user’s message.

  3. Append the original message from the user with the "role":"user" and send this to chat completions endpoint. IMO the reduced size of system message and virtually no decision making between which commands/filters to use, will enable better performance even on gpt-3.5-turbo. If it doesn’t, the gpt-4 api call will cost a fraction of the current one.

Probably won’t need it if this approach works out. I’d be happy to help, but this particular project not exactly straightforward generation and the process would be much faster if testing is done following every change to see what works.

PS: I spent some time thinking on this problem, and since it has to do with structuring the model’s response, fine-tuning is also an alternative - but only if the above approach doesn’t work out.

1 Like

Spent some time experimenting on this.

Here’s an example output of returning relevant commands using embeddings.

@RonaldGRuckus is not wrong, if I had used the command names, the performance would have been abysmal, instead I mapped each command name and description to action strings and used them for embeddings.

The User messages without reply didn’t score enough proximity to any of the action strings.

It’s not perfect but it can be improved to reach satisfactory performance.

Using this can even limit the functionality of the bot to our defined behavior and prevent unexpected/undesired behavior such as jailbreak/abuse etc. Doing so would be simple, if a command/action is returned by the code, the API call is made for generation else a simple message to the user to rephrase their query.

2 Likes

With the embeddings, are you doing them locally? One curveball I have is that I am using Ruby here so a lot of the libraries available in python are much harder to consume. We do though have access to a vector db for easy vector similarity. Leaning on ada here would add tons of latency. What embeddings did you test this with?

Thanks so much for exploring this @sps!

1 Like

The embeddings for the action strings generated by "text-embedding-ada-002" are stored in a file on my machine, which are compared with embeddings for the user message.

Just needs a cosine similarity implementation.

I generated embeddings for the following list of strings and mapped them to the commands

['enumerate all categories',
 'generate the time in a timezone',
 'search all posts and topics',
 'generate a summary',
 'show most frequent meta tags',
 'render an image from the description',
 'google']
2 Likes

I see, so in this case I need to make a call to the embedding service first for the prompt? (so it may be 3 calls to OpenAI to make a reply)

graph LR
  A[Get embedding for prompt via Ada] --> B[Check distance<br>via cosine similarity]
  B -- Close --> C[Get command from GPT 3.5]
  B -- Not Close --> D[Simply reply]
  C --> F[Run command locally to get context]
  F --> E[Feed data plus context to GPT 3.5]
  D --> E
1 Like

Yes, it requires a call to the OpenAI API to obtain embeddings for the user message. To our benefit the response time for obtaining embeddings is minimal compared to the completion call.

The final API call with context can be avoided if the results are served to the user in pre-defined format, similarly for the not close step as well, a default message can be sent as reply - if it suits your design and functionality requirements and conversational reply isn’t the main feature.

Both approaches have their own advantages and tradeoffs.

PS: Didn’t know we could do flowcharts here. This is nice.

1 Like

Correct. And the cost is ~1/600th the cost of a completion.

2 Likes

I procrastinated for 14 days … and … well… none of this is needed anymore thanks to the function calling feature :confetti_ball:

I am not quite ready to merge it, but it seems to work reasonably well even on GPT 3.5

1 Like

Definitely. The new gpt-3.5-turbo and its 16k version is way better than the 0301 version. I have another project where I had to use gpt-4 because of the complexity, and the new gpt-3.5-turbo model excels in the same task.

The function calling is indeed a game changer.

2 Likes