Gpt-3.5-turbo-1106 hangs occasionally

When I send this request it will occasionally take an exceptionally long time to finish the request. Adding timeouts solves this but it’s difficult to know how long to set the timeout. This does not happen for me on gpt-3.5-turbo-16k. Here’s the request I’m sending.

{
    "model": "gpt-3.5-turbo-1106",
    "messages": [
        {
            "role": "user",
            "content": "Summarize the following text: Abstract “... Given an utterance and recent dialogue context containing past 3 utterances (wherever available), output ‘Yes’ if the utterance contains the small-talk strategy, otherwise output ‘No’. Small-talk is a cooperative negotiation strategy. It is used for discussing topics apart from the negotiation, to build a rapport with the opponent.” How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce S UPER -NATURAL I NSTRUCTIONS,1 a benchmark of 1,616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions— training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones. Furthermore, we build Tk-I NSTRUCT, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-I NSTRUCT outperforms existing instruction-following models such as InstructGPT by over 9% on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.2 • Input: “Context: … ‘That's fantastic, I'm glad we came to something we both agree with.’ Utterance: ‘Me too. I hope you have a wonderful camping trip.’” • Explanation: “The participant engages in small talk when wishing their opponent to have a wonderful trip.” Negative Examples • Input: “Context: … ‘Sounds good, I need food the most, what is your most needed item?!’ Utterance: ‘My item is food too’.” • Output: “Yes” • Explanation: “The utterance only takes the negotiation forward and there is no side talk. Hence, the correct answer is ‘No’.” • Input: “Context: … ‘I am excited to spend time with everyone from camp!’ Utterance: ‘That’s awesome! I really love being out here with my son. Do you think you could spare some food?’ ” • Expected Output: “Yes” Figure 1: An example task from S UP -NAT I NST adopted from Chawla et al. (2021). A successful model is expected to use the provided instructions (including task definition and demonstration examples) to output responses to a pool of evaluation instances. The NLP community has witnessed great progress in building models for generalization to unseen tasks via in-context instructions (Mishra et al., 1 S UPER -NATURAL I NSTRUCTIONS represents a supersized expansion of NATURAL I NSTRUCTIONS (Mishra et al., 2022b) which had 61 tasks. The dataset, models, and a leaderboard can be found at https:// instructions.apps.allenai.org. ♢ Co-first authors ♣ Co-second authors 2022b; Sanh et al., 2022; Wei et al., 2022) using large pretrained language models (Raffel et al., 2020; Brown et al., 2020). As remarkable as models like InstructGPT (Ouyang et al., 2022) are, the contribution of various design choices to their success is opaque. In particular, the role of supervised data has remained understudied due to limited data released by the corporate entities behind major models. In addition, it is nearly impossible for the research community to extend and re-train these gigantic models. Addressing these two chal- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085 - 5109 December 7-11, 2022 ©2022 Association for Computational Linguistics Table 1: A comparison of S UP -NAT I NST to a few notable datasets in the field. We obtain the number of tasks, instructions, and task types of other datasets from their original paper. “–” indicates the fields are not applicable or unknown. Standards for categorizing task types vary across different datasets (see Fig. 2). *P ROMPT S OURCE does not provide task type annotation for all their tasks, for which we report only the 13 task types annotated for training T0 (Sanh et al., 2022) instead. Figure 2: Compared to other datasets, S UP -NAT I NST covers a more diverse range of task types. InstructGPT reports a very coarse categorization of their task types. Bubble size represents the number of tasks of each type in log scale. lenges necessitates the availability of large-scale public benchmarks of a broad range of NLP tasks and their instructions to facilitate developing and evaluating models that can generalize to unseen tasks. In this paper, we construct a meta-dataset (i.e., dataset of datasets; Triantafillou et al., 2019) that consists of a wide variety of NLP tasks with their instructions, and train a model that can perform a new task given the instruction, outperforming InstructGPT (which uses 16× more parameters). Our dataset, S UPER -NATURAL I NSTRUCTIONS (S UP -NAT I NST for short), is a large benchmark of 1,616 NLP tasks and their natural language instructions. It brings in a diverse variety of tasks—76 broad task types spanning 55 different languages. Each task is paired up with an instruction that consists of the task definition for mapping an input text to a task output and several examples for demon- strating the desired or undesired output (see Fig.1 as an example task). These tasks and their instructions are contributed by 88 NLP practitioners, in response to our public call. These contributions are consolidated after several rounds of peer-review and crowdsourced feedback to ensure quality. Having this diverse and large-scale data enables us to carefully split the tasks into training and test sets and systematically study how state-of-the-art methods perform on them. Table 1 and Figure 2 highlight properties of S UP -NAT I NST compared to relevant benchmarks, emphasizing the diversity of tasks and instruction types in our benchmark. Our model, Tk-I NSTRUCT, is a generative model for transforming task inputs given declarative in-context instructions (task definition or kshot examples). It is built by multi-task training of the T5 model (Raffel et al., 2020) over all the task instructions in our training set, and is eval- uated on unseen tasks in the test set. Interestingly, an 11B-parameter Tk-I NSTRUCT can outperform the 175B-parameter InstructGPT model by 9.9 ROUGE-L points when evaluated on 119 unseen English tasks, and the multilingual variant mTk-I NSTRUCT outperforms InstructGPT by 13.3 points on 35 non-English tasks (§6.1). According to human evaluation, Tk-I NSTRUCT generates responses at least as well as the ground truth for 77% of the testing instances (§6.2), confirming its strong generalization to unseen tasks. The compelling empirical performance of TkI NSTRUCT confirms the importance of super-sized meta datasets such as our S UP -NAT I NST to facilitate research towards generalizable NLP models. We conduct extensive analysis to understand the important factors for this generalization (§7). Our analysis shows that scaling up the diversity of training tasks and the model size are both important for strong generalization to unseen tasks. Finally, we estimate performance upper bounds, suggesting further room for improvement. Language instructions are a versatile way of defining goals, which is why they have been studied in the context of a variety of applications, such as instructions in grounded environments (Shridhar et al., 2020; Stepputtis et al., 2020; Min et al., 2022b; Weir et al., 2022) and database commands (Kim et al., 2020). Here, we focus on applications of instructions for general NLP tasks. Recent literature has been motivated by building models that are generalizable across a variety of NLP tasks, when prompted with either a few examples (Ye and Ren, 2021; Bragg et al., 2021) or language definitions (Efrat and Levy, 2020; Weller et al., 2020; Zhong et al., 2021; Mishra et al., 2022b,a; Parmar et al., 2022). Our work is related to the existing benchmarks in this line of work, as delineated in Table 1 along various dimensions. Our benchmark extends NAT I NST (Mishra et al., 2022b) with 26× more tasks and greater variety of task types (Fig. 2). While C ROSS F IT (Ye et al., 2021) focuses on benchmarking with a few in-context examples, our benchmark also offers task instructions. (Bach et al., 2022) is another benchmark of tasks and their language instructions (prompts). An important distinction between this benchmark and ours is the phrasing of the task definitions: while P ROMPT S OURCE task definitions are relatively concise, our task definitions are collected with the intention of providing a complete definition of each task and therefore are longer (24 tokens vs. 56 tokens on average; Table 1). More recently, B IG B ENCH (Srivastava et al., 2022) introduces a collection of 204 tasks and also provides short task descriptions and input prefixes that can be used for prompting LMs. With little overlap to our collection of tasks, they focus more on finding challenging tasks that can be used to test different behaviors of current LMs. Nevertheless, we believe that all these efforts in collecting different tasks as well as the task instructions are complementary, and the community will benefit from considering different benchmarks. Finally, the well-adopted InstructGPT model (Ouyang et al., 2022) is partially enabled by a large dataset of prompts that are collected via various synthetic data augmentation which, unfortunately, is not publicly available. Beyond cross-task generalization, our benchmark can also be used to study multi-task learning more broadly, which is a longstanding goal for AI (Caruana, 1997). Traditionally, this literature focuses on setups that involve evaluation on tasks that are observed during training (Collobert and Weston, 2008; Hashimoto et al., 2017). More recent studies show promise that large-scale multi-task learning can enable strong generalization to similar tasks via unified encoding (Khashabi et al., 2020; Xie et al., 2022) or better finetuning results on downstream tasks (McCann et al., 2018; Aribandi et al., 2022). Our proposed benchmark provides diverse tasks for studying multi-tasking at a massive scale. S UPER -NATURAL I NSTRUCTIONS is a metadataset (Triantafillou et al., 2019) consisting of a variety of NLP tasks (see Fig. 2a) and instructions that describe them in plain language. Instruction schema. All task instructions follow the same uniform schema (see Fig. 1) which is composed of the following parts: • D EFINITION defines a given task in natural language. This is a complete definition of how an input text (e.g., a sentence or a document) is expected to be mapped to an output text. • P OSITIVE E XAMPLES are samples of inputs and their correct outputs, along with a short explanation for each. • N EGATIVE E XAMPLES are samples of inputs and their incorrect/invalid outputs, along with a short explanation for each. The above schema is based on that of Mishra et al. (2022b), though it is simplified. See Appendix C for the comparison. Task instances. Given the instructions for each task, a model is expected to solve instances of that task. We use a unified format to organize the instances of all our tasks. More precisely, each instance consists of a textual input and a list of acceptable textual outputs. We limit the number of instances in each task to 6.5K to avoid an imbalance of instances between tasks. Benchmark collection. The benchmark was collected through a large community effort on GitHub.3 Tasks were collected and contributed by NLP practitioners who were either responding to our public invitation4 or students who were encouraged to contribute as part of their class project.5 Contributors were encouraged to be creative and source the tasks from several resources: (a) existing public NLP datasets, (b) available intermediate annotations in crowdsourcing experiments (e.g., paraphrasing questions or rating their quality during crowdsourcing a QA dataset), or (c) synthetic tasks that can be communicated to an average human in a few sentences (e.g., basic algebraic operations like number comparison, finding the longest palindrome substring, etc.). When using existing datasets or crowdsourcing annotations, contributors were encouraged to adopt the instructions used to create this dataset whenever available. This was done to ensure that the instructions were sufficient to define the tasks to average human readers. Tasks along with instructions and other meta information were contributed as JSON files via GitHub pull requests, which were reviewed by automated checks and peers. We had 88 contributors from diverse locations and backgrounds contribute to our repository. Quality control. Controlling the quality of this community-contributed data was done in several phases: (1) Upon creating a GitHub pull request of the proposed task, it immediately went through an automatic test. This process verified that the introduced file contained the expected fields and adhered to our desired properties (e.g., no duplicate 3 https:// github.com/ allenai/ natural-instructions https:// blog.allenai.org/ 9d3f24d5a9db CSE 576 “Topics in NLP” course, Arizona State Univ. 4 instances, the output labels are not heavily imbalanced, etc.) and (2) The proposed task was then peer-reviewed by 1–2 other expert contributors to ensure the clarity and sufficiency of instruction content. The review process was done iteratively until the reviewers were content with the quality of the proposed instruction. Specifically, reviewers were asked to verify that the instruction is clear and sufficient for an average language speaker to solve the underlying task (evaluation instances) while being grammatical, fluent, and concise. On average, the review of each GitHub pull request took about 4– 6 iterations over the span of multiple days before being merged. (3) Lastly, the added tasks were presented to crowdworkers in order to collect feedback on the quality of the provided instructions, such as typos, clarity, or other issues (details in §A). Subsequently, one of the authors used this feedback to improve the task definitions of the instances. This feedback was done only for English tasks, as finding high-quality crowdworkers in other languages is nontrivial (Pavlick et al., 2014). Diversity of tasks. Collecting tasks for S UP NAT I NST was carefully supervised to cover a wide variety of natural language understanding tasks, domains, and languages. To better understand this diversity, we comprehensively categorize tasks along three different dimensions: • TASK T YPE defines the nature of the mapping from instance inputs to outputs (e.g., question answering, classification, etc.). • L ANGUAGE indicates the language(s) of the instances. • D OMAIN indicates the domain(s) to which the text of the tasks belong to (e.g., politics, medicine, dialogue, etc.). These different measures of categorization can be used to study different senses of generalization. In our empirical studies (§5), we study generalization along the axis of task types. We refer the reader to Fig. 10 in the appendix for the distribution of tasks among different task types, languages, and domains. Statistics. Table 2 shows various statistics for the benchmark. In total, the dataset includes 1616 tasks and 5M instances. On average, each instruction is paired with 2.8 positive and 2.4 negative examples. The average definition length is 56.6 in words. Here we provide our recommended recipe for benchmarking generalization via S UP -NAT I NST. 5.1 Defining Generalization to Unseen Tasks. Each task t is defined via its natural language instruction It , and each task has a set of input/output instances (Xt , Yt ). A model M is expected to produce the output y, given the input x and the task instruction It : M (It , x) = y, for (x, y) ∈ (Xt , Yt ). In particular, we would like to evaluate model M on tasks that are not observed (i.e., their instances were not used for training M ). The only source of signal for learning the task at inference time is in-context instructions It that contain a definition and demonstration examples of the task. Tk-I NSTRUCT. We introduce Tk-I NSTRUCT, a model that is meta-trained on S UP -NAT I NST for solving tasks given their in-context instructions. Previous work has shown the effectiveness of such meta-training in improving model’s ability to do incontext learning with either prompts (Zhong et al., 2021; Sanh et al., 2022) or demonstration examples (Min et al., 2022a). Because of the large variety of tasks in S UP -NAT I NST, we are able to do this multi-task meta-training at a larger scale than before. We conduct our experiments and analysis based on the T5 model (Raffel et al., 2020). Since each instruction It consists of multiple elements as described in our instruction schema (§3), we map these elements to textual format and append them before the input instance. Fig. 8 in the appendix shows how we encode the full instructions. We study different combinations of these instruction elements in §7.2. By default, we will use our most effective instruction elements (i.e., task definition and two positive examples) unless otherwise specified. In the same manner, we train the multilingual variant mTk-I NSTRUCT based on the mT5 model (Xue et al., 2021). An Evaluation Split of Unseen Tasks. We split the large collection of tasks in S UP -NAT I NST into two subsets: one for evaluation and the other for supervision. For evaluation tasks, we fix a manuallyselected collection of 12 categories that represent 154 tasks. The large variety of tasks in S UP NAT I NST enables us to choose a diverse set of tasks for evaluation – such as those at word, sentence, and document levels, covering both classification and generation formats. Appendix G lists our evaluation tasks with examples for representative tasks. For an efficient evaluation, we sample a maximum of 100 instances for each task, which results in 15,310 testing instances in total. The remaining tasks are used for training models.6 Divided Tracks for English and X-lignual Tasks. S UP -NAT I NST consists of tasks across multiple languages, which enables evaluating the model’s generalization to unseen tasks not only in English but also in other languages. Therefore, we divide our evaluation tasks into two tracks: one for English cross-task generalization (119 tasks) and the other for cross-lingual cross-task generalization (35 tasks). To the best of our knowledge, this is the first study in cross-lingual cross-task generalization (i.e., generalization to unseen tasks in different languages). Fig. 11 and Fig. 12 in the appendix contain the evaluation tasks for each track. Evaluation Metrics. Due to the diversity of our tasks and the open-ended generation nature of our formulation,7 we adopt ROUGE-L (Lin, 2004) for reporting aggregated performance results. This is a soft string overlap metric that can be applied to a wide range of text generation tasks. We show that the ranking from this metric correlates well with accuracy for classification tasks in Appendix E. We also conduct a human evaluation in §6.2. 6 To avoid data leakage, we exclude tasks from the training set if they are sourced from the same dataset as any test task. This results in 757 training tasks for the English track and 1271 training tasks for the cross-lingual track. 7 Unlike Sanh et al. (2022) and Wei et al. (2022), who evaluate their models on classification tasks via option ranking (i.e., scoring the correct answer(s) higher than other candidate answers), we evaluate our models in an open-ended generation setting with no task-specific assumptions. We believe this is a more realistic measure of generalization to unseen tasks. Here we discuss a variety of baselines and competitive models for our target application. See Appendix D for implementation details. Heuristic baselines. We first evaluate the following heuristics to evaluate the possible shortcuts in the data. Copying Demo Output copies the output of a random demonstration example. Since we balance the labels for our test tasks, the performance of this baseline will roughly equal a random guess or a majority baseline for classification tasks. Copying Instance Input copies the given instance input. This strategy performs well on tasks where the target output largely overlaps with the input (e.g., question rewriting, grammar error correction). Off-the-shelf pretrained language models. We evaluate existing LMs that are not fine-tuned with instruction-specific data. Specifically, we evaluate the 11B-parameter T5 (Raffel et al., 2020) as a direct counterpart of Tk-I NSTRUCT. Due to the infilling pretraining objective of the original T5 model, it cannot continue text well. Therefore, we evaluate its “LM-adapted” version, which is further trained with a language modeling objective (Lester et al., 2021). Additionally, we evaluate GPT-3 (Brown et al., 2020), a 175B-parameter autoregressive LM that has shown remarkable ability in following demonstrations provided in its prompt. Instruction-tuned models. In addition to our TkI NSTRUCT (§4), we evaluate existing models that are fine-tuned to follow language instructions. In particular, we evaluate InstructGPT (Ouyang et al., 2022) which uses reinforcement learning to incorporate human preferences into a GPT-3 pretrained model, and T0 (Sanh et al., 2022) which finetunes T5 on a collection of task prompts in P ROMPTS OURCE (Bach et al., 2022). Upper bound estimates. We estimate an upper bound on models’ generalization to unseen tasks by fine-tuning an oracle model on the tasks’ labeled instances. Since this model observes the hidden instances of the evaluation tasks, it is, by definition, an estimated upper bound to our generalizationbased models. Specifically, we fine-tune a T5-11B model on the 119 English evaluation tasks, and a mT5-13B model on the 35 non-English tasks, with 1K random training instances per task, without overlap with the evaluation instances. Table 3: The overall performance of different methods on unseen tasks in the test set of S UP -NAT I NST (§6.1). We report ROUGE-L here as our aggregated metric. Models that leverage instructions show stronger generalization to unseen tasks. In particular, our model that is fine-tuned on a diverse set of tasks outperforms InstructGPT and T0 by a large margin. Figure 3: Human evaluation vs. ROUGE-L for several methods (§6.2). The trends of these two metrics are highly correlated with a Pearson coefficient of 0.998. Table 3 summarizes our overall benchmarking results. We use the same input encoding that contains the most effective instructional elements (task definition and two positive examples without the negative examples and explanations) for all the methods. To better understand models’ generalization to different tasks, we also break down the performance according to the task categories in Fig. 4. We refer the reader to Appendix H for more detailed analysis on each individual task. Instruction-tuning enables stronger generalization to unseen tasks. Generally instruction-tuned models perform better compared to their untuned LM counterparts (Tk-I NSTRUCT vs. T5-LM, InstructGPT vs. GPT-3) and heuristic baselines. This indicates models do learn to follow instructions by finetuning on instruction data, and this can generalize to new instructions for unseen tasks. T0 is an exception, which is only slightly better than Figure 4: Performance per evaluation task type. Tk-I NSTRUCT consistently performs better than other generalizationbased methods on all task types, while there is still a sizable gap compared to supervised training. T5-LM. We suspect this is because the style of prompting in T0’s training data is very different from our style of instructions. Our Tk-I NSTRUCT outperforms InstructGPT. Our Tk-I NSTRUCT and mTk-I NSTRUCT models, which are trained with a variety of tasks, generalize best to unseen tasks for both English and non-English tasks in all evaluation task categories. InstructGPT also shows a great extent of generalization to our evaluation tasks. However, we want to note it is not clear if InstructGPT’s training data overlaps with our evaluation tasks since their data is unavailable. There is a sizable gap for improvement. Despite the impressive performance of current models, there is a sizable gap between the generalization of instruction-based models and the supervised training approach, leaving more room for improvement. 6.2 Human Evaluation For language generation tasks, automatic metrics are only an approximation of human judgments; we conduct a human evaluation to confirm the findings so far. Specifically, we ask crowdworkers to indicate if they prefer the predicted answer by the model or the ground truth outputs for each instance with ties being allowed (see Appendix B for details). The resulting human evaluation metric indicates how often model predictions were rated as at least as good as our ground truth labels. The theoretical upper bound of this metric is 100% when the model is rated at least as good as the ground truth for all the instances. The results of human evaluation (shown in Fig. 3) align quite well with our automatic metrics and confirm the human-perceived quality of our models. We conduct further analysis to understand the important factors for models to generalize across tasks. Due to the computational cost, this analysis is done on the English track and using the T5-3B checkpoint, except for the experiments on model sizes. 7.1 We study Tk-I NSTRUCT’s generalization performance with respect to three scaling factors: the number of training tasks, the number of instances per task, and the model sizes. Fig. 5 presents the performance change by scaling each of them. More observed tasks improve the generalization. We fine-tune Tk-I NSTRUCT with different numbers of tasks that are randomly sampled from the whole training set (Fig. 5a). The model generalization performance grows log-linearly8 as we increase the set of tasks used for training. Previous work (Mishra et al., 2022b; Sanh et al., 2022; Wei et al., 2022) has made similar observations on a much smaller scale, while we show that this trend holds even with 757 diverse training tasks. A large number of training instances do not help generalization. We then vary the number of instances per task that are used for finetuning (Fig. 5b). While the conventional wisdom in supervised learning is that more training instances usually helps (Banko and Brill, 2001; Sun et al., 2017; Hestness et al., 2017), in our setup, the model’s performance saturates when only 64 instances per task are used for training. A large number of training instances would instead lead to longer training time and risk overfitting to the training tasks. 8 A linear function of an exponential increase of parameters, i.e., growth at a constant multiplicative rate. Figure 5: Scaling trends of models performance (§7.1) as a function of (a) the number of training tasks; (b) the number of instances per training task; (c) model sizes. x-axes are in log scale. The linear growth of model performance with exponential increase in observed tasks and model size is a promising trend. Evidently, the performance gain from more instances is limited. Testing Encoding → Table 4: Performance (ROUGE-L) of models trained and evaluated with various encodings. Diagonal numbers (underlined) represent performances of models trained and evaluated with the same instruction encoding. Each encoding is a combination of the elements in the instructions (Fig. 1). Task ID is a short string composed of dataset name and task category; Def represents the task definition; Pos (k) represents k positive examples; Neg (k) represents k negative examples; Expl represents explanation. These results (a) show the gains from various instructional elements, and (b) indicate surprising reliability of the models to various input encoding. A model trained with definition and positive examples (i.e., the last row) remains robust for different encodings. Tuning larger models with instructions consistently lead to gains. We study the effect of model scaling by initializing Tk-I NSTRUCT from different sizes of pretrained T5 checkpoints, including the small, base, large, xl and xxl sizes (Fig. 5c). We found that increasing the model sizes consistently bring significant improvement (log-linearly with parameter size). This finding contradicts the claim in Xu et al. (2022) that “model size has little impact on performance with an extremely large number of tasks.” Combining Fig. 5(a) and Fig. 5(c), one can create a correspondence between model size and task size. For example, a T5-large model trained with 757 tasks can achieve comparable performance (48.0 ROUGE-L) to the T5-3B model trained with 128 tasks (48.4 ROUGE-L), indicating that increasing the diversity of training tasks is an alternative to scaling model sizes. We evaluate the performance of Tk-I NSTRUCT under different instructional elements. Benefit of different instructional elements. As shown in Fig. 1, S UP -NAT I NST provides multiple elements for instructing a task. We train multiple models with different combinations of these elements. The diagonal cells of Table 4 show the performance of our models when trained and evaluated on a particular instruction encoding. Based on the diagonal numbers, including the task definition consistently helps the model to generalize better. Moreover, combining the task definition with positive demonstration examples yields further improvement."
        }
    ]
}

Here’s the response after 4m

1 Like

I have this issue too.
What’s worse is that I get billed for it.

Also have this issue - I’m making several calls to several bots but at some point it will just stop and I check the server logs and no errors thrown its just sitting there, no idea what’s going on.

Is this also I’m assuming for you guys that this happened after devday?

More info, another long running request: