I’ve been tinkering with getting prompts for “parsing research paper abstracts” into their separate parts, but haven’t found a reliable way to reformat them, and Elicit doesn’t have a prompt for this either. Anyone want to give it a shot or suggest an alternate way?
Paper abstracts tend to follow a structured approach: Background, Data, Methods, Results, Conclusion. Sometimes journals require these to be explicit (and there can be a whole zoo of separate ‘topic’ sections), but even when not explicit, they typically are followed. When they’re not followed but implicit, the abstract can be hard to skim. Take this typical example:
In tumor-bearing mice, cyclic fasting or fasting-mimicking diets (FMDs) enhance the activity of antineoplastic treatments by modulating systemic metabolism and boosting antitumor immunity. Here we conducted a clinical trial to investigate the safety and biological effects of cyclic, five-day FMD in combination with standard antitumor therapies. In 101 patients, the FMD was safe, feasible, and resulted in a consistent decrease of blood glucose and growth factor concentration, thus recapitulating metabolic changes that mediate fasting/FMD anticancer effects in preclinical experiments. Integrated transcriptomic and deep-phenotyping analyses revealed that FMD profoundly reshapes anticancer immunity by inducing the contraction of peripheral blood immunosuppressive myeloid and regulatory T-cell compartments, paralleled by enhanced intratumor T-helper 1/cytotoxic responses and an enrichment of interferon-gamma and other immune signatures associated with better clinical outcomes in cancer patients. Our findings lay the foundations for phase II/III clinical trials aimed at investigating FMD antitumor efficacy in combination with standard antineoplastic treatments.
This is very far from the worst offender, but your eyes quickly glaze over. It could instead be written like this, adding a few newlines:
In tumor-bearing mice, cyclic fasting or fasting-mimicking diets (FMDs) enhance the activity of antineoplastic treatments by modulating systemic metabolism and boosting antitumor immunity.
Here we conducted a clinical trial to investigate the safety and biological effects of cyclic, five-day FMD in combination with standard antitumor therapies.
In 101 patients, the FMD was safe, feasible, and resulted in a consistent decrease of blood glucose and growth factor concentration, thus recapitulating metabolic changes that mediate fasting/FMD anticancer effects in preclinical experiments. Integrated transcriptomic and deep-phenotyping analyses revealed that FMD profoundly reshapes anticancer immunity by inducing the contraction of peripheral blood immunosuppressive myeloid and regulatory T-cell compartments, paralleled by enhanced intratumor T-helper 1/cytotoxic responses and an enrichment of interferon-gamma and other immune signatures associated with better clinical outcomes in cancer patients.
Our findings lay the foundations for phase II/III clinical trials aimed at investigating FMD antitumor efficacy in combination with standard antineoplastic treatments.
More readable already. Now imagine you are reading through thousands of these… It’d be nice if I could run all of the paper abstracts for stuff I cite on Gwern.net through a topicfyer as the first automatic rewrite pass. It would save me time when editing them by hand, and would make all of the automatic abstracts that I pull down from Arxiv etc a lot more readable.
Splitting is not that hard, and GPT-3 understands scientific prose well enough to do it. The obvious prompt is to do something like
Add newlines to split this abstract:
"$AN_ABSTRACT"
to: "$FORMATTED_ABSTRACT"
Add new lines to split this abstract:
"$USER_INPUT"
to: "...
The problem is that even with instruction-series davinci, this is highly unreliable as a prompt. Sometimes it will do it right, other time it’ll copy the abstract without modifying it, other times it’ll change some words or drop some sentences (extremely undesirable if it’s going to run fully automated!).
How to fix?
I’ve played around with the instructions but haven’t found any improvements. Using no before/after examples didn’t work well for me. Maybe someone else can come up with a different prompt.
Abstracts can be lengthy, so I run out of context quickly if I try to include more than 1 abstract (because you need 2 before-after abstracts per example, and you have to fit ~2 for the actual abstract to be processed), so just putting in more examples to few-shot it doesn’t work.
Finetuning an engine might work. I am not sure if Curie is smart enough to do the job well, and I haven’t used finetuning before, so it would be work to create a finetuned engine just to see if it worked. I’m hoping to avoid that.
The need to copy the abstract literally and exactly is a major constraint, and I’ve wondered if there’s some clever way to use the Search or Classification endpoint instead. Could individual sentences be ‘classified’ by topic and linebreaks inserted at topic-changes? Sounds potentially very inefficient.
Or something with logprob of a newline at each sentence ending? (I don’t see any obvious logprob boosts of \n
checking a few abstracts’ sentence-ends in the Playground, but maybe something relative could be done there.)
Could possible sets of breaks be searched? Generate every possible set of linebreaks, and do Search against the original to find the one which ‘looks most natural’? For short abstracts, there’d be only a couple possible sets of linebreaks… Might not work with long abstracts with many sentences if one tries to search all possible breaks, but one could try to do an adaptive search and ascend: search n initial candidates, then keep the best and search its similar ones.