OpenAI provides the ability to enhance user’s prompt with a set of function descriptions to let the GPT model decide which tool to use.
I am looking for a benchmark dataset which would allow me to evaluate how well GPT matches uesr intent with those functions (as a baseline). And then compare my onw system to this baseline on the same evaluation set.
Does anyone know of such benchmark dataset?
Context:
In dialogue systems, accurately mapping a user’s query to an appropriate action from a set of possible options is crucial. Given a user prompt p
, the goal is to select an action a
from a set of actions A
such that the probability of choosing the correct action is maximized, i.e., argmax_i(p(a_i | p))
.
I spent this past weekend searching for a benchmark dataset to evaluate and potentially expand the training dataset for the system I developed, but I couldn’t find anything relevant. It seems strange that such a dataset wouldn’t be publicly available.
Closest Available Datasets I found are:
- ATIS: Airline travel specific with only 26 categories.
- Intent Classification from Amazon Alexa and IoT: Limited to Amazon-specific contexts.
- CLINC150: Only includes 7 classes.