That doesn’t sound particularly challenging to me, TBH.
There is plenty of factual data out there - no need to generate it. So start with tidbits of facts, taken from any number of sources, like ScienceDaily and Reddit, then use my work to generate the questions that go to the data. Simple as that. It’s a question of putting the cart before the horse. I think you might have started backwards. As an older data scientist once told me: getting good data is ALWAYS the hardest problem. Well, you’ve got plenty of good data out there.
As far as different domains, it would make sense to curate a dataset for each class/domain/topic and then use that to train fine-tuned models.
Take this article, for example, found on Reddit Climate Change: Permafrost Thaw in Siberia Creates a Ticking 'Methane Bomb' of Greenhouse Gases, Scientists Warn | Smart News| Smithsonian Magazine
This points to the study: https://www.pnas.org/content/118/32/e2107632118
So now you’ve got a primary source with no domain expertise required. This is Grade A data here.
So then you just ask Davinci to write QA pairs with INSTRUCT. From here, you can start to create more data that can be used to cultivate finetuning datasets so that you can easily generate millions of Question/Answer pairs from any arbitrary article:
Given the following passage, generate question and answer pairs for college students.
Passage:
Anthropogenic global warming may be accelerated by a positive feedback from the mobilization of methane from thawing Arctic permafrost. There are large uncertainties about the size of carbon stocks and the magnitude of possible methane emissions. Methane cannot only be produced from the microbial decay of organic matter within the thawing permafrost soils (microbial methane) but can also come from natural gas (thermogenic methane) trapped under or within the permafrost layer and released when it thaws. In the Taymyr Peninsula and surroundings in North Siberia, the area of the worldwide largest positive surface temperature anomaly for 2020, atmospheric methane concentrations have increased considerably during and after the 2020 heat wave. Two elongated areas of increased atmospheric methane concentration that appeared during summer coincide with two stripes of Paleozoic carbonates exposed at the southern and northern borders of the Yenisey-Khatanga Basin, a hydrocarbon-bearing sedimentary basin between the Siberian Craton to the south and the Taymyr Fold Belt to the north. Over the carbonates, soils are thin to nonexistent and wetlands are scarce. The maxima are thus unlikely to be caused by microbial methane from soils or wetlands. We suggest that gas hydrates in fractures and pockets of the carbonate rocks in the permafrost zone became unstable due to warming from the surface. This process may add unknown quantities of methane to the atmosphere in the near future.
In a warming world, the release of CO2 and methane from thawing permafrost to the atmosphere may lead to a positive feedback by increasing the concentration of greenhouse gases (1⇓–3). Methane is particularly critical because of its high global warming potential per mass unit. In review articles on this subject, the focus is mainly on organic matter stored in frozen soils and its microbial decay and release as microbial methane upon thawing (1⇓–3). However, thermogenic methane, i.e., natural gas from the deeper subsurface, may also contribute to the feedback. A proportion of thermogenic methane in addition to the dominant microbial methane was found in gas emission craters in Western Siberia (4). For the subsea permafrost in the East Siberian Arctic Shelf, it was argued that thawing can make the permafrost layer permeable for gas stored as hydrates or as free gas within the permafrost layer and also for subpermafrost gas (5). Isotopic signatures of methane released in the East Siberian Arctic Shelf are consistent with an origin as old, deep, and likely thermogenic methane
Question and Answer Pairs:
Q: How will permafrost thawing liberate methane?
A: Methane can be produced by the decay of thawed organic matter and from methane trapped beneath the permafrost.
Q: What kind of methane was found in Siberia?
A: Thermogenic methane was found in Siberian gas emission craters.
I gave it one example and you see it can generalize pretty quickly.