Fake News ML Guardian 🧙🏻‍♂️

In Brazil, unfortunately, we have the bad habit of sharing false news, and this ended up changing the course of the last 3 elections in the country – from mayors to president. I’m in the ideation and prototyping phase of a web app that will check fake news sites (via URL/server parsing) using machine learning. This will be my 2022 project!

I would like to know if anyone, whether an engineer or curious, would be interested in participating in this construction using GPT-3! What do you who are reading think about participating?

I’m open to feedback on the idea too. Whatever the result of this effort, I know that it will be much better if it is built together and to improve the democratic process in my country, and, who knows, the entire world!

6 Likes

Yes! Check out my idea. We have started building a list of datasets. Let’s collaborate!

1 Like

How are you determining your source of ‘truth’ for the app?

7 Likes

great question! I have been thinking about two approaches: information about the hosting, domain, security certificates, protocols, etc. of the site where the client will consult. It would use this data provided by the browser to filter out potential “dirty” sources of information. but actually I’ve been considering a second approach as well: machine learning about hundreds or maybe thousands of fake site urls to create clusters.

what do you think @asabet ?

I’ve already visited your topic and I was frighteningly impressed, haha!

I would love to collaborate with you and this is what you are bringing to the community.

my idea is very much associated with the current problem I live in my country, and it could even be a kind of MVP for something bigger like a lie detector. What do you think?

1 Like

I think that would be great. I personally would love to have something where you just grab a link for a news article and paste it into a service (or an app) and it uses NLP/GPT-3 to parse the claims in the article, check for accuracy, and summarize. Because what do I do if I come across something that I want to verify? I just google for similar articles to get the cross-check and cross-validation. But not everyone does that.

Perhaps the biggest problem is that some people want to stay in their echo chambers. So how do you use GPT-3 to expose people to information and ideas that they find objectionable? Does their free will figure into it? These are bigger questions that may or may not figure into an MVP.

1 Like

First issue I see is that classifying the entire text from a labelled misinformation site may not be entirely accurate, as news from most sources have elements of truth and falsehood in them (ie bias). If you’re training a classifier on sources of misinformation, it might give better results to have misinformation labelling at a more granular level (ie sentences and paragraphs), otherwise you might have high false positives due to unrelated factors like writing style.

There are also cases where obtaining unbiased expert insight is difficult, ie for evaluating medical misinformation, and the community’s consensus on what is ‘true’ can change over time. It might help to create a transparent+community-driven database for tracking and discussing misinformation sources, then have a review process for the sources of text that are used to train a potential misinformation classifier. If you can figure out a process that can reliably produce ‘ground-truth’, training a classifier would be much easier, in my opinion.

2 Likes

The problem is that you’re thinking of this as a traditional NLP task e.g. classification. To use GPT-3 for that task would just be silly. What we’re talking about is like automatically generating SNOPES articles.

1 Like

? Perhaps you need me to clarify my post? First paragraph I point out that classification of an entire text would be difficult (‘silly’ in your words), but give concrete reasons as to why. However, if you break down a text into statements (ie sentences), then you it may be possible to accurately evaluate the statement with a finetuned classifier (or a classifier paired with a search model). Since language models can act as knowledge bases (Petroni et al 2019), then a finetuned model can plausibly classify individual statements as misinformation, if it’s regularly trained on an accurate ground-truth. It’s unproductive to make unqualified assertions about a specific method, as it depends on which assumptions you make for your task, and actual experimental results.

Second paragraph I point out that, regardless of your approach, constructing an accurate wiki-style database is the more important task. The success of whatever you do downstream (ie like training a classifier, or snopes generator) is most dependent on the ground-truth dataset itself.

Regardless, I’m directly addressing @marcelx’s questions about ‘checking’ and ‘filtering’ sources, which isn’t mutually exclusive from whatever you’re saying :pray:.

I disagree with a lot of your assumptions, but it’s okay to disagree.

friends @asabet and @daveshapautomator , regardless of the direction of this idea, I think our discussion is going in the right direction, because the subject is indeed thorny; it would be easier to deal with AI to solve sales, marketing, or even grammar rules. developing and training a machine to deal with lies is also doing the same for the truth, and this is a paradox that, from my point of view, few still have the courage to discuss how we are doing here, so I am grateful for that.

now about the body of the project: I like the suggestion of having a database and text analysis at granular levels, but I think the complexity of this needs to be treated very seriously to avoid creating a failed fake news machine. I still think “attacking” the SOURCES is the best way to find the white rabbit.

My approach: A list (DB) of thousands of “Pink Slime” sites used for machine training can generate really interesting output. So, to avoid creating something deterministic about truth and lies, I would use percentages for the end user. All the parameters (SSL, protocols, manual delation and more data) Something like “The Bot learned there is an 89% chance this font is fake” rather than something binary truth or lie.

can you help me to see any flaw in this idea?

2 Likes

here are some interesting things about user view for this source check. this technical part would be done by the machine (GPT-3) over and over again and again until check source better than any human /journal/fact check site

Get Technical

Different formats of media come with their own conventions. Most fake news that you will encounter disguises itself as legitimate online news, so knowing the conventions of legitimate online news sources will help you understand how fake news differs.

  • Check the domain name.

    • Legitimate news sources usually have a professional domain name that matches the name of their organization. For instance, the website for CBC news is http://www.cbc.ca/news. Fake news URLs are less likely to be professional in nature or identifiable as a distinct news organization.
    • Identify the top-level domain of a URL, as this will tell you the country where the site is hosted (eg. .ca, .au) or the purpose of the site (.edu, .com). Fake news sites sometimes use URLs that mimic legitimate sites but use a different top-level domain: for instance, http://www.cbc.ca.co/news.
    • Site names ending in “lo,” such as Newslo, are also conventionally fake.
  • Check for an About Us page, a Contact Us page, or other information pages. All legitimate news sites have pages like this, although the names may differ.

  • Check the links. Broken links happen to the best of us, including legitimate news sources. However, most links on a news article should work, and these links should take the reader to other, legitimate sources.

  • Have a look at the web design. Examples of poor web design include sites with too many colours or fonts, poor use of white space, and numerous animated gifs. Good web design is a sign of credibility, and legitimate news sources will prioritize having a proper website. A news organization like the CBC can afford to hire a web designer; they cannot afford to have a site that is unpleasant to visit. This is not to say that all sites with good web design are legitimate.

  • Learn to recognize paid content. Many legitimate news sources include advertising on their site, often in the form of native advertising that blends in with regular articles. Paid advertising like does not meet the standards of true journalism. Some examples of native advertising are available on this Milton Academy Library Guide.

  • Check who owns the domain. If you’re curious about who owns a website, try looking it up on
    https://www.whois.net/. For instance, a search for cbc.ca will show that it is owned by the Canadian Broadcasting Company.

  • Install a browser extension to warn you when you are visiting a fake news site, such as the Fake News Alert for Chrome.

  • Research the images. If an image used in a news articles looks suspicious to you, try using TinEye or Google reverse image search to find out if the image has previously been used elsewhere. If it has, check if it has since been edited. If the image is legitimate, searching for other images of the same scene might provide you with more context.
    source: Identifying Fake News - Fake News - Research Guides at Ryerson University Library

more here: Identifying Fake News Sources - Evaluating Websites - MaxGuides at Bridgewater State University

This is a great conversation. On point! Ten years ago or so, I wanted to build a similar system. Yet much more rudimentary.

As you know, at some point we have to surrender our knowledge to the folks that know way more than us as a collective in a particular subject. Take Science for an example, we/they love to prove others wrong via confirming observations or theory. That’s how we get our best answers whether right or wrong, and move forward. Reality always reveals itself eventually.

Here was the idea, Say you’re using Chrome or FF and you’re reading an article, a post, a research paper, whatever, and you want to verify something written as truth, you would highlight the copy, (through a plugin/add on) right click choose a new context item saying “verify content” which calls a service with the logic to confirm the accuracy of any postulate by sourcing related context from objective sources (like you mention in Get Technical). That’s straight forward work.

YET,

How do you promote trust in your system to be developed, when the tribalism has gotten so fierce? Wrong or not, trusting the collective objectivity of a statement - trust as best as humans can provide - somehow has to be learned by the folks who will not even communicate with people who do not deliver info to reassure their confirmation bias. Then we’re back to square one. That’s the challenge here. And I don’t have the brain power to solve it.

Think of self driving cars, the tech is so simple and very safe, yet how do we get everyone to drive that car, and get them to take their hands off the wheel? It’s not gonna be easy. But it has to happen eventually.

@marcelx, good luck - it’s brilliant work that can change the world, literally. Just keep pushing, the “VIX” on false information in growing exponentially every day.

My nonsense aside, I did design a logo and did some concept work on mascots back then. I’ll attach for a laugh.

2 Likes

the other :slight_smile:

Logo-YarnDog_SmallIcon

3 Likes

Wow! Really informative video. Thanks for sharing!

1 Like

@marcelx @asabet and @daveshapautomator

This is a really cool project. Happy to throw my hat into it. I hve built a telegram bot for the same purpose. it’s pretty rudimentary in that sense, but do check & tell me what you think :slightly_smiling_face:

and here’s a hackernoon article talking about its architecture:

1 Like

I think the scope of its usefulness would be limited to low-hanging fruit like spam posts. Here in the US our own government officials lie to us regularly, and when they are caught, they just ignore the subject and redirect to something else, or come up with logically invalid defenses (Those highly confidential documents with alarming information that accidentally get leaked? Uhh we cannot comment on it as it’s confidential and part of an ongoing investigation) until people stop talking about it. Your proposed AI would probably be automatically ranking these as top level sources. In some cases this may be even more damage than benefit, as there are already plenty of folks that accept statements from the government as gospel. The last thing they need is an independent fact checker that reassures them that the government would never lie. :innocent:

Being skeptical and doing your own thorough research is necessary if you wish to be well informed on a topic. Think about it for a moment. Being reliant on 3rd party AI’s to provide you with what it deems to be the most appropriate content for you. It’s almost as if… they WANT you to look no further and see nothing else.

1 Like

I like the idea but I think you’re looking at the problem from the wrong perspective. The news is one thing and the people consuming it is another.

For example if a well known fitness expert says that a certain protein is good and another protein sucks. How do we confirm that his/her claim is true or false?

Is there science literature that we can look at? is their studies done on this subject?

The average person won’t bother to look more into it, they’ll just take the fitness experts word for it. Why? because it causes friction to have to search, read and interpret the information. It’s much easier for a so called “expert” to tell us.

So the idea isn’t to just spot where the fake news is but to convince the general public that your solution is trustworthy and that they should check in with your solution and see if that news is fake or not.

Think kelly blue book, a site that tells you the value of the car you’re trying to sell, as well as the car your trying to buy.

Before KBB you had to search and see what the value of the car was by cross referencing different sites, dealers etc. and then do the calculation to see what the average price is in order to determine if you were selling or buying at the right price.

Now KBB is trusted even by dealerships because if KBB saids that my car is worth $5000 and you’re offering me $3000, I know for a fact I’m being lowballed.

Now let’s use the news in this context. The president of Brazil saids that covid cases our going down but the data from hospitals says otherwise.

I see the news and might think “are cases really going down?” I go and check your web app that tells me that there’s an 80% chance it’s not true and there’s cited sources that I can see for myself.

So the aim is to create trust with your solution because you’re asking the general public to trust your solution more than the news and the government.

2 Likes

This is the key point here. In the information age, finding reliable information, finding discussions, and determining the trustworthiness of a source are easy AF. As you pointed out, people just don’t do it. Instead, we have people self-selecting into echo-chambers where everyone looks for confirmation bias.

No one who wants to believe that COVID is a hoax is going to check reliable news sources. They are only going to go straight to their preferred propaganda station.

What’s really needed is a more psychological approach to the problem. This is called “infodemiology” so you might want to explore that @marcelx. Using something like GPT-3 to not just verify a single source, but to track down where the information is traveling to and from.

1 Like

I cross-check everything and to be honest I don’t like how much time it consumes.

So I had an idea a while back before I discovered GPT-3, of crawling through the first page of google, checking for duplicate information (since a lot of sites just copy and paste their info from another source) and then summarizing it.

That way I didn’t have to go through most of the search results in order to cross reference.

Now that I see you mention that you would like having something where you can use a link to parse the claims, check for accuracy and summarize.

A chrome extension makes sense, it can do those checks for you when you land on the article. The honey chrome extension came to mind where you go a site that has something for sale and honey checks if there’s any discount codes you can use.

In this case the extension would crawl the site, check for accuracy and provide a summary.

[quote] So how do you use GPT-3 to expose people to information and ideas that they find objectionable? Does their free will figure into it? These are bigger questions that may or may not figure into an MVP.
[/quote]

This is a great question, but I wouldn’t bother to expose it to them. I would rather keep people informed with the truth and let that spread. Eventually they’ll get exposed to it and have no choice but to question their beliefs.

People challenged the claim that the earth was round and then humans verified that it was indeed round, so when people came out again claiming the world was flat, they looked stupid because the truth had spread far enough.

1 Like