Factual completions

deanmkn · August 25, 2021, 5:45am

Curious as to completions the community has come up with that keep completion responses factual whilst allowing for a higher temperature.

Many prompts I’ve designed that rely on responses being factual and truthful result in me having to drop the temperature and/or top p to very low. This is fine, however it restricts responses to a very deterministic style.

Allowing for higher temperatures gives it space to create more interesting responses and less robotic sounding responses, although this allows for very low log probability tokens to pass through leading to some rubbish made up responses that are far from factual.

Any thoughts?

asabet · August 25, 2021, 6:00am

There are several approaches I can think of. Simplest would be to prompt the model to answer ‘I don’t know’ in cases where it’s uncertain, or to track the completion’s logprobs (ie sentence mean logprob) to see how confident it is.

The best option might be to finetune with factual completions though, and see how it affects the model. GPT3’s pretrain data wasn’t 100% factual, so maybe having a 100% factual finetune dataset will bias the model more towards facts.

deanmkn · August 25, 2021, 6:17am

Unfortunately a response of “I don’t know” wouldn’t be applicable in this scenario. Fine-tuning sounds about right and we are waiting on a response for our davinci fine-tune request. Was just curious if anyone has any little prompt hacks to steer it in the right direction.

asabet · August 25, 2021, 6:27am

Finetuning davinci sounds interesting, though how’re you deal with davinci latency? The only prompt hack I can think of is to describe the model as being factual and answering as if it’s a scientist (or whatever other specialty) and see what affect that has. Imo curating finetune examples might be more important though.

Another option might be to finetune a small/fast classifier model that labels how factual the completions are and re-generating if not.

deanmkn · August 25, 2021, 10:07am

Latency thankfully is not an issue in our use-case. However, that last point you made is something we didn’t consider. Definitely something I’ll look into to, thanks for the suggestion!

daveshapautomator · August 25, 2021, 11:20am

You’ll probably have to break it up into multiple prompts. Use the low temp prompt to get the answer and a higher temp prompt to explain it eloquently. I ran into the same thing with NLCA (Natural Language Cognitive Architecture) when attempting to use CURIE for factual information. I think I’ll first need to classify queries as EPISODIC or DECLARATIVE. But it sounds like you only want declarative knowledge.

My first solution was to load all of Wikipedia into an offline search index (SOLR) and then just use GPT-3 to generate search strings. It is excellent at generating search queries. From there, I’d pull the relevant articles from SOLR and use QA to generate answers. The benefit of this system is that you can put any articles into the search index. News, books, case files, medical history, etc. Plus it’s very fast.

deanmkn · August 25, 2021, 12:15pm

Very cool idea. Sound’s like it would eat up tokens thought for a single use-case which results in eating up our margins but a great idea none-the-less. I’ll look into it.

daveshapautomator · August 25, 2021, 12:57pm

It really depends on the domain. Is there a reason you want to extract factual information from a transformer? That’s not really what it was designed for.

deanmkn · August 25, 2021, 10:37pm

Yes it’s because the result is shown to students (https://www.slidespace.ai/), with some human intervening by teachers.

daveshapautomator · August 25, 2021, 10:48pm

Hmm this is a pretty controlled environment. You’re basically transmuting existing material and just regurgitating it with a transformer. This would be ideal to load up the base material into a DB and then just use a combination of search and QA to generate the content. You’re right that you’ll be shelling out for tokens with your current strategy. Furthermore, you can probably save more by storing each transformer prompt/completion in a database and searching that first with semantic similarity before even going to the transformer. Eventually you’ll build up a huge corpus of questions and answers that students ask and those will be free for you, your most valuable IP. From those questions you can help teachers generate high quality content that anticipates the student’s questions. Combine that with some other education oriented ML and you can model what is actually in each student’s head.

deanmkn · August 25, 2021, 11:29pm

You’re on a roll and you’re absolutely right. We know how students study/learn, we know how they perform, we have hundreds of thousands of questions and students performance on those questions. Then we can start to predict how students will perform on their assessments, and predict given only which students are enrolled in a subject, the average grade of that subject. A lot of amazing insights to be extracted.

daveshapautomator · August 25, 2021, 11:50pm

Sounds like you’ve already got your fine-tuning data then. Just plug in the questions as the prompts and the answers as the completions. Unless you need the answers? I’ve got some good experience generating training data. CURIE should be good enough for this kinda thing. I did something similar here:

You might also get a lot out of this dataset. GitHub - Guzpenha/MANtIS: MANtIS - a multi-domain information seeking dialogues dataset

deanmkn · August 26, 2021, 12:18am

Very cool stuff. A lot of the issue is that we aren’t just trying to generate only questions, but also the correct answers to those questions, and in some case the incorrect answers. It’s hard to generate fine-tune datasets of over 200-300 samples that generalise across a variety of niche topics like endocrinology, recursive programming, blockchains, climate change etc. simply because we can’t be certain the answers are factually correct. Not that it’s impossible, it just takes a lot of time. We are currently resorting to crowd sourcing our training data by students providing feedback on whether the resources we generated are good vs bad, as they have more domain knowledge than us.

daveshapautomator · August 26, 2021, 2:24am

That doesn’t sound particularly challenging to me, TBH.

There is plenty of factual data out there - no need to generate it. So start with tidbits of facts, taken from any number of sources, like ScienceDaily and Reddit, then use my work to generate the questions that go to the data. Simple as that. It’s a question of putting the cart before the horse. I think you might have started backwards. As an older data scientist once told me: getting good data is ALWAYS the hardest problem. Well, you’ve got plenty of good data out there.

As far as different domains, it would make sense to curate a dataset for each class/domain/topic and then use that to train fine-tuned models.

Take this article, for example, found on Reddit Climate Change: Permafrost Thaw in Siberia Creates a Ticking 'Methane Bomb' of Greenhouse Gases, Scientists Warn | Smart News| Smithsonian Magazine

This points to the study: https://www.pnas.org/content/118/32/e2107632118

So now you’ve got a primary source with no domain expertise required. This is Grade A data here.

So then you just ask Davinci to write QA pairs with INSTRUCT. From here, you can start to create more data that can be used to cultivate finetuning datasets so that you can easily generate millions of Question/Answer pairs from any arbitrary article:

Given the following passage, generate question and answer pairs for college students.

Passage:
Anthropogenic global warming may be accelerated by a positive feedback from the mobilization of methane from thawing Arctic permafrost. There are large uncertainties about the size of carbon stocks and the magnitude of possible methane emissions. Methane cannot only be produced from the microbial decay of organic matter within the thawing permafrost soils (microbial methane) but can also come from natural gas (thermogenic methane) trapped under or within the permafrost layer and released when it thaws. In the Taymyr Peninsula and surroundings in North Siberia, the area of the worldwide largest positive surface temperature anomaly for 2020, atmospheric methane concentrations have increased considerably during and after the 2020 heat wave. Two elongated areas of increased atmospheric methane concentration that appeared during summer coincide with two stripes of Paleozoic carbonates exposed at the southern and northern borders of the Yenisey-Khatanga Basin, a hydrocarbon-bearing sedimentary basin between the Siberian Craton to the south and the Taymyr Fold Belt to the north. Over the carbonates, soils are thin to nonexistent and wetlands are scarce. The maxima are thus unlikely to be caused by microbial methane from soils or wetlands. We suggest that gas hydrates in fractures and pockets of the carbonate rocks in the permafrost zone became unstable due to warming from the surface. This process may add unknown quantities of methane to the atmosphere in the near future.

In a warming world, the release of CO2 and methane from thawing permafrost to the atmosphere may lead to a positive feedback by increasing the concentration of greenhouse gases (1⇓–3). Methane is particularly critical because of its high global warming potential per mass unit. In review articles on this subject, the focus is mainly on organic matter stored in frozen soils and its microbial decay and release as microbial methane upon thawing (1⇓–3). However, thermogenic methane, i.e., natural gas from the deeper subsurface, may also contribute to the feedback. A proportion of thermogenic methane in addition to the dominant microbial methane was found in gas emission craters in Western Siberia (4). For the subsea permafrost in the East Siberian Arctic Shelf, it was argued that thawing can make the permafrost layer permeable for gas stored as hydrates or as free gas within the permafrost layer and also for subpermafrost gas (5). Isotopic signatures of methane released in the East Siberian Arctic Shelf are consistent with an origin as old, deep, and likely thermogenic methane

Question and Answer Pairs:
Q: How will permafrost thawing liberate methane?
A: Methane can be produced by the decay of thawed organic matter and from methane trapped beneath the permafrost.

Q: What kind of methane was found in Siberia?
A: Thermogenic methane was found in Siberian gas emission craters.

I gave it one example and you see it can generalize pretty quickly.

deanmkn · August 26, 2021, 5:25am

Unfortunately I wish it were as easy as you’ve described. There are many, many restrictions and requirements in place i.e. you don’t want to teach students non-examinable content, content provided by an educator may be an abstraction of some outside information, content comes in the form of badly-formatted lectures and the list goes on.

If we were just creating questions from an article such as the one you sent, then it would be more than applicable to simply run through the process you suggested. However, there are so many more complexities that I can’t get into unfortunately.

deanmkn · August 26, 2021, 5:26am

This is a tiny example from a big picture. Understandably those few dot points alone don’t seem too useful and perhaps an overcomplication, but they are without the context of the rest of our business model.

daveshapautomator · August 26, 2021, 8:43am

Well, without examples of what you really need, it will be difficult to provide any help. The homepage you sent says that any teaching materials will do, including articles. Then it shows flashcards with QA pairs. So I hope you understand my confusion.

Topic		Replies	Views
Use "private" dataset as basis for AI responses Prompting	29	2958	December 16, 2023
Having trouble to make AI avoid certain topics Prompting	13	3575	April 17, 2022
What are your favorite text-based dialog datasets? Community	5	1794	January 26, 2024
Adding prompt info to fine-tuning API	14	3173	December 25, 2023
Fine tuning using a corpus API api	8	2213	July 13, 2023

Factual completions

Related topics