Structuring paper abstracts by topic?

gwern · December 10, 2021, 7:15pm

I’ve been tinkering with getting prompts for “parsing research paper abstracts” into their separate parts, but haven’t found a reliable way to reformat them, and Elicit doesn’t have a prompt for this either. Anyone want to give it a shot or suggest an alternate way?

Paper abstracts tend to follow a structured approach: Background, Data, Methods, Results, Conclusion. Sometimes journals require these to be explicit (and there can be a whole zoo of separate ‘topic’ sections), but even when not explicit, they typically are followed. When they’re not followed but implicit, the abstract can be hard to skim. Take this typical example:

In tumor-bearing mice, cyclic fasting or fasting-mimicking diets (FMDs) enhance the activity of antineoplastic treatments by modulating systemic metabolism and boosting antitumor immunity. Here we conducted a clinical trial to investigate the safety and biological effects of cyclic, five-day FMD in combination with standard antitumor therapies. In 101 patients, the FMD was safe, feasible, and resulted in a consistent decrease of blood glucose and growth factor concentration, thus recapitulating metabolic changes that mediate fasting/FMD anticancer effects in preclinical experiments. Integrated transcriptomic and deep-phenotyping analyses revealed that FMD profoundly reshapes anticancer immunity by inducing the contraction of peripheral blood immunosuppressive myeloid and regulatory T-cell compartments, paralleled by enhanced intratumor T-helper 1/cytotoxic responses and an enrichment of interferon-gamma and other immune signatures associated with better clinical outcomes in cancer patients. Our findings lay the foundations for phase II/III clinical trials aimed at investigating FMD antitumor efficacy in combination with standard antineoplastic treatments.

This is very far from the worst offender, but your eyes quickly glaze over. It could instead be written like this, adding a few newlines:

In tumor-bearing mice, cyclic fasting or fasting-mimicking diets (FMDs) enhance the activity of antineoplastic treatments by modulating systemic metabolism and boosting antitumor immunity.

Here we conducted a clinical trial to investigate the safety and biological effects of cyclic, five-day FMD in combination with standard antitumor therapies.

In 101 patients, the FMD was safe, feasible, and resulted in a consistent decrease of blood glucose and growth factor concentration, thus recapitulating metabolic changes that mediate fasting/FMD anticancer effects in preclinical experiments. Integrated transcriptomic and deep-phenotyping analyses revealed that FMD profoundly reshapes anticancer immunity by inducing the contraction of peripheral blood immunosuppressive myeloid and regulatory T-cell compartments, paralleled by enhanced intratumor T-helper 1/cytotoxic responses and an enrichment of interferon-gamma and other immune signatures associated with better clinical outcomes in cancer patients.

Our findings lay the foundations for phase II/III clinical trials aimed at investigating FMD antitumor efficacy in combination with standard antineoplastic treatments.

More readable already. Now imagine you are reading through thousands of these… It’d be nice if I could run all of the paper abstracts for stuff I cite on Gwern.net through a topicfyer as the first automatic rewrite pass. It would save me time when editing them by hand, and would make all of the automatic abstracts that I pull down from Arxiv etc a lot more readable.

Splitting is not that hard, and GPT-3 understands scientific prose well enough to do it. The obvious prompt is to do something like

Add newlines to split this abstract:
"$AN_ABSTRACT"
to: "$FORMATTED_ABSTRACT"
Add new lines to split this abstract:
"$USER_INPUT"
to: "...

The problem is that even with instruction-series davinci, this is highly unreliable as a prompt. Sometimes it will do it right, other time it’ll copy the abstract without modifying it, other times it’ll change some words or drop some sentences (extremely undesirable if it’s going to run fully automated!).

How to fix?

I’ve played around with the instructions but haven’t found any improvements. Using no before/after examples didn’t work well for me. Maybe someone else can come up with a different prompt.

Abstracts can be lengthy, so I run out of context quickly if I try to include more than 1 abstract (because you need 2 before-after abstracts per example, and you have to fit ~2 for the actual abstract to be processed), so just putting in more examples to few-shot it doesn’t work.

Finetuning an engine might work. I am not sure if Curie is smart enough to do the job well, and I haven’t used finetuning before, so it would be work to create a finetuned engine just to see if it worked. I’m hoping to avoid that.

The need to copy the abstract literally and exactly is a major constraint, and I’ve wondered if there’s some clever way to use the Search or Classification endpoint instead. Could individual sentences be ‘classified’ by topic and linebreaks inserted at topic-changes? Sounds potentially very inefficient.

Or something with logprob of a newline at each sentence ending? (I don’t see any obvious logprob boosts of \n checking a few abstracts’ sentence-ends in the Playground, but maybe something relative could be done there.)

Could possible sets of breaks be searched? Generate every possible set of linebreaks, and do Search against the original to find the one which ‘looks most natural’? For short abstracts, there’d be only a couple possible sets of linebreaks… Might not work with long abstracts with many sentences if one tries to search all possible breaks, but one could try to do an adaptive search and ascend: search n initial candidates, then keep the best and search its similar ones.

NimbleBooksLLC · December 11, 2021, 2:19am

Are you familiar with the Allen Institute’s Semantic Search tools? GitHub - allenai/s2orc: S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/. I think they have a lot of what you need. I am not sure OpenAI is the best tool for all of the pieces of what you are doing.

gwern · December 15, 2021, 5:03pm

I’ve used their search tools a bit and know they have a parsed corpus, but they don’t have any reformatting tools that could be used like I suggest here or attempt to automatically parse out implicit sections that I know of. (Looking up entries in their corpus isn’t relevant: where my data sources, like Pubmed, include section formatting, they come already paragraph-separated and I just have to avoid stripping it, and I do. The problem is the many many sources - particularly Arxiv! - where everyone just mashes the paragraphs together even though they still implicitly follow the general Background/Methods/Results/Conclusion schema.)

Their corpus (as well as many others) could, of course, be used for training a model (obvious approach: seq2seq where a corrupted version with the section identifiers stripped is paired with the original) but training a large language model what I’m trying to avoid. A working prompt would be so much easier to develop & use, and it would potentially be reusable for other similar abstract-processing tasks.

lmccallum · December 16, 2021, 5:53pm

My use case for legal rules presents a very similar problem. Unfortunately, I think a big component of the problem data cleaning. I’d use some other tool to parse by sentence, then use GPT-3 for classification and show it a few hundred exaples, as recommended in the documentation, of Background, Data, Methods, Results and Conclusion, e,g, train it to learn the semantics of each type of sentence.

gwern · December 16, 2021, 6:46pm

Hm… Yes, that was what I was thinking of by reference to the Classification and per-sentence classification: generate a set of sentences for ‘Background’, ‘Data’, ‘Methods’ etc, and then concatenate the set with $CURRENT_SENTENCE, to get a list of classifications like [Background, Background, Data, Data, Data, Data, Results, Results, Conclusion] and then insert \n\n after sentences #2, #6, #8, & #9 where there is a topic transition.

I didn’t like this idea because I’m a bit skeptical that a sentence in isolation may be classifiable, classification errors or mixed-up writing make it hard to detect ‘changes’ (what if you get Data, Methods, Data, Methods, Methods? Is this actually Data, Data, Data, Methods, Methods, or is it reporting multiple experiments, or just really is a mish-mash?) and, having never used the Classification endpoint, I thought it would be very expensive to do a request per sentence which includes a large number of examples for each possible category.

How does the cost work for Classification? The API parameters/Guide don’t clarify this for me: if I upload, say, 1000 labeled examples (eg n=200 for 5 categories each) amounting to 10k BPEs and I try to Classify a 10 BPE sentence, does this cost 10 BPEs or 10k+10 BPEs or something else entirely? If it’s the former, then there’s no problem cost-wise, it’s probably even cheaper than a working reformatting prompt, and only requires one pass. If it’s the latter…

lmccallum · December 16, 2021, 6:54pm

Feature request from OpenAI: teach GPT-3 how to clean different types of datasets, the provide an engine/endpoint so that users can input their raw data and get cleaned data.

gwern · December 17, 2021, 3:51am

Still trying to figure out the pricing here, which is clear as mud. As I’m reading the FAQ (seems like the wrong place compared to the API docs), each sentence would cost me… >$0.05 to classify? Can that possibly be right?

Number of tokens in all of your documents
+ (Number of documents + 1) * 14
+ (Number of documents + 1) * Number of tokens in your query

So as I’m reading this, if I go with my scenario of 5 topics with n=200 examples each (1000 total), and since scientific abstracts are fairly jargon and notation intense assume something like 30 BPEs, and using ada’s $0.0008/1000-BPEs, that would translate to

ndocs <- 1000; avgTokensPerDoc <- 30; 
(((ndocs*avgTokensPerDoc) + 
 (ndocs + 1) * 14 + (ndocs + 1) * 30) / 1000)
 * 0.0008
# [1] 0.0592352

(I’m going to ignore the cost of the additional regular completion because I can’t figure that out from the FAQ. Come on guys.)

Does this sound about right? It’s higher than I expected but another user also seemed to get high costs too… At $0.06 a sentence, a single abstract could easily be a buck. I have a current set of ~6.5k and I expect to get at least that many in the future, so that’s a bit steep. (Multiplying a bit would suggest a total cost of $1.9k to parse topics.)

The main cost here seems to come from the ndocs parameter blowing up the total number of tokens passed into Search. Is everyone using this with relatively few classified documents? If I imagine instead only n=5 examples per topic, then it’s much more feasible, coming at more like $50 total:

R> ndocs <- 5*5; avgTokensPerDoc <- 30; perSentenceCost <- (((ndocs*avgTokensPerDoc) + (ndocs + 1) * 14 + (ndocs + 1) * 30) / 1000) * 0.0008; 6500 * 5 * perSentenceCost
[1] 49.244

This is fairly reasonable but a little bit discouraging. Splitting by topic is but one of many transformations I might want to apply: I was looking into it because I thought it would be one of the easiest to get working, but it’s already proving to be a bit daunting. Considering that, I may be better off trying to finetune the complete set of edits as before/after pairs instead of doing multiple phases in a pipeline.

gwern · February 3, 2022, 3:34am

Some quick testing of InstructGPT (text-davinci-001) suggests that the newer versions are starting to work for parsing by section.

gwern · February 20, 2022, 3:49am

I realized that there is a very simple way to verify that instruct-Davinci GPT-3 copied exactly in a insert-newlines prompt: simply strip the newlines & check equality! That guarantees that the copied version did not change any words or do anything but insert newlines to create paragraphs.

With that resolved, a simple "Split into paragraphs:\n"$INPUT"\nTo:\n"" prompt works reasonably well. The main problems are that it will often not split at all, and it handles any HTML, Unicode, or complex Markdown very poorly - but neither of those are unfamiliar problems to API users… If it doesn’t do anything on half the inputs, well, that’s better than nothing. (GPT-3 really could use training on richer text than the stripped down WET files. HTLM/CM3 are well worth a look and the capabilities would be highly valuable, particularly for business users I imagine.)

I have put up a simple Python script paragraphizer.py and integrated it into Gwern.net to run on new annotations. I’ve also been running it on my existing annotations, although it’s blowing through my monthly billing limit which is unfortunate.

Topic		Replies	Views
Need help creating a copy editor for novels and other long texts API	4	1244	December 1, 2023
Limits and limits and limits API	2	1188	May 31, 2021
Reducing filler, fluff, and meta in responses? API prompt-engineering	9	1326	January 11, 2024
Practical Tips for Dealing with Large Documents (>2048 tokens) API	6	8358	December 17, 2023
Cannot get gpt-4o-mini to follow instructions API gpt-4o-mini	7	267	October 8, 2024

Structuring paper abstracts by topic?

Related topics