[Paper] Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Abstract

Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4’s capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model’s out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

Key Results

Summary and Conclusions

We presented background, methods, and results of a study of the power of prompting to unleash top-performing specialist capabilities of GPT-4 on medical challenge problems, without resorting to special fine-tuning nor reliance on human specialist expertise for prompt construction. We shared best practices for evaluating performance, including the importance of evaluating model capabilities on an eyes-off dataset. We reviewed a constellation of prompting strategies and showed how they could be studied and combined via a systematic exploration. We found a significant amount of headroom in boosting specialist performance via steering GPT-4 with a highly capable and efficient prompting strategy.

We described the composition of a set of prompting methods into Medprompt, the best performing prompting strategy we found for steering GPT-4 on medical challenge problems. We showed how Medprompt can steer GPT-4 to handily top existing charts for all standard medical questionanswering datasets, including the performance by Med-PaLM 2, a specialist model built via finetuning with specialist medical data and guided with handcrafted prompts authored by expert clinicians. Medprompt unlocks specialty skills on MedQA delivering significant gains in accuracy over the best performing model to date, surpassing 90% for the first time on the benchmark.

During our exploration, we found that GPT-4 can be tasked with authoring sets of custom-tailored chain-of-thought prompts that outperform hand-crafted expert prompts. We pursued insights about the individual contributions of the distinct components of the Medprompt strategy via ablation studies that demonstrate the relative importance of each component. We set aside eyes-off evaluation case libraries to avoid overfitting and found that the strong results by Medprompt are not due to overfitting. We explored the generality of Medprompt via performing studies of its performance on a set of competency evaluations in six fields outside of medicine, including electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology. The findings in disparate fields suggests that Medprompt and its derivatives will be valuable in unleashing specialist capabilities of foundation models for numerous disciplines. We see further possibilities for refining prompts to unleash speciality capabilities from generalist foundation models, particularly in the space of adapting the general MedPrompt strategy to non multiple choice questions. For example, we see an opportunity to build on the Medprompt strategy of using GPT-4 to compose its own powerful chain of thought examples and then employ them in prompting. Research directions moving forward include further investigation of the abilities of foundation models to reflect about and compose fewshot examples and to weave these into prompts.

While our investigation focuses on exploring the power of prompting generalist models, we believe that fine-tuning, and other methods of making parametric updates to foundation models are important research avenues to explore, and may offer synergistic benefits to prompt engineering. We maintain that both approaches should be judiciously explored for unleashing the potential of foundation models in high-stakes domains like healthcare.

Background Reading

2 Likes

Algorithm

Algorithm 1 Algorithmic specification of Medprompt, corresponding to the visual representation of the strategy in Figure 4.

Algorithm 1

Algorithmic specification of Medprompt, corresponding to the visual representation of the strategy in Figure 4.

1: Input: Development data \mathcal{D}, Test question \mathcal{Q}
2: Preprocessing:
3: for each question q in \mathcal{D} do
4: \quad Get an embedding vector v_q for q.
5: \quad\quad Generate a chain-of-thought C_q and an answer A_q with the LLM.
6: \quad\quad if Answer A_q is correct then
7: \quad\quad\quad Store the embedding vector v_q, chain-of-thought C_q, and answer A_q.
8: \quad\quad end if
9: end for
10:
11: Inference Time:
12: Compute the embedding v_Q for the test question \mathcal{Q}.
13: Select the 5 most similar examples \{(v_{Q_i}, C_{Q_i}, A_{Q_i})\}^5_{i=1} from the preprocessed training data using
KNN, with the distance function as the cosine similarity: dist(v_q, v_Q) = 1 - \frac{v_q \cdot v_Q}{\|v_q\| \|v_Q\|}.
14: Format the 5 examples as context \mathcal{C} for the LLM.
15: for 5 times do
16: \quad\quad Shuffle the answer choices of the test question.
17: \quad\quad Generate a chain-of-thought C_k^* and an answer A_k^* with the LLM and context \mathcal{C}.
18: end for
19: Compute the majority vote of the generated answers \{A_k^*\}^K_{k=1}:

A_{\text{Final}} = \text{mode}(\{A_k^*\}^K_{k=1}),

\quad\quad where \text{mode}(X) denotes the most common element in the set X.
20: Output: Final answer A_{\text{Final}}.

1 Like

My only issue with this paper is I would have loved to have seen the results of their medprompt with the specialized models.

1 Like