Training data for summarizing web articles

Hi, wondering if anyone can help with this…

What is the best way to train / fine-tune GPT3 to summarise web articles? Data would be needed from different sources - mainly blogs and websites with content that could be about business or marketing or the food industry, for example? Are there training datasets already available or are there companies that help create such datasets for use in fine-tuning?


1 Like

What would your article summary look like? How long should it be?

Hi Serge,

summaries would be key points from articles (reduce article to ~20-30% of original size) e.g. for an article like this - 3 Food Innovations Changing How the World Eats | by USAID | U.S. Agency for International Development | Medium, the summary would pick out headings and summarise below e.g.

3 Food Innovations Changing How the World Eats

Cricket-raising farms, coffee flour, and mold-killing technology are among the winners of a recent USAID LAUNCH food challenge - an open competition to find innovative ideas in food.

Cricket-Raising Farms

Crickets could offer a surprising but promising solution to food demands. In 2013, the United Nations Food and Agriculture Organization proposed using insects as an alternative protein source for the growing global population. Insect farming company, Entomo Farms can supply up to 5,000 pounds of raw crickets per week at an acceptable price point. They provide cricket powder to food producers who include it in protein bars, shakes, pasta and more.

Coffee Flour

Key points here etc.

1 Like

Oh yeah, you should definitely be able to do something like this. I will make a video later today to show you how you might want to mess around with this idea.

I am at work currently (openAI project hasn’t quite taken off for me yet :stuck_out_tongue: ), however once I am home I will record a short video that shows you how you can train this in the short-term, meaning simply in the OpenAI playground. And then I will also give you ideas for how you can ‘fine-tune’ your model, using no less than 300-500 examples of desired completions.

If u can’t wait for the video, then this is what you do. You ‘stack’ your articles and successful summaries inside of the playground. separate each article/summary pair by ‘###’ or any other stop symbol you please. The stop symbol should mark the end of one prompt/completion pair.

I haven’t tested the prompt below but I’m 99.9% sure it will give you a very very juicy completion. Especially if you stack 3-4 articles on top of each other. Then you’ll see about 60-70% of your article summaries are just insanely good. And then you’ll never really conquer that last 30-35% without ‘fine-tuning’ your model with 500+ examples.

How do you get 500+ examples? 3 ways.

  1. Make them yourself using OpenAI (Yawn)
  1. Pay a data scientist to create the summaries for you. They will perfectly craft these summaries exactly the way the user and the AI wants to see them (Crazy expensive)

  2. Number 3 is the hardest way, but definitely the BEST! You create an application that people can use to summarize articles. You give it to them for free, or you make them pay, but the important part is you save ALL of their completions and prompts into a database so that way you can use these to train your model. Woo-hoo! Almost free data :smiley:

Oh by the way, we are both in EdTech - we should connect.



Task: Summarize the ‘Article’ into 3 main points.




Task: Summarize the ‘Article’ into 3 main points.

Summary: [Let OpenAI Complete this one for you :smiley:]


Great thanks very much Devin - very helpful!
Will connect - love to chat more about #3 as that’s the challenge for sure! :wink: