How is GPT-4 getting this information from a URL?

I have a python function that sends a URL for a news article to GPT-4 along with prompts to get information from it.

[
            {
                "role": "system",
                "content": 'You help people get information from articles'
            },
            {
                "role": "user",
                "content": f'Here is a URL: {url}'
            },
            {
                "role": "user",
                "content": 'Using JSON format, tell what "country" is the article about, what is the "article date" and provide a 100 word "summary" of it.'
            },
        ]

I understand that GPT-4 through the API doesn’t have internet access per this response to some URLs:

Response: I'm sorry, but as an AI, I'm currently unable to directly access or retrieve information from external URLs or websites. However, I can help you understand how to extract information from an article. If you provide the text or key details from the article, I'd be more than happy to help summarize or analyze it.

However, I also get the response I want from some URLs and looks like it is able to go to the URL and get the information from that rather than just scanning the URL string.

For example, this URL: www.leadership.ng/4000mw-electricity-generation-shameful-unacceptable-adelabu gives me this response:

Response: {
  "country": "Nigeria",
  "article_date": "2021-10-13",
  "summary": "The governorship candidate of the All Progressives Congress (APC) in Oyo State, Chief Adebayo Adelabu, has described Nigeria's 4000MW electricity generation as shameful and unacceptable. He stated this during a lecture at the University of Ibadan, where he emphasized the need for Nigeria to improve its power sector. Adelabu also highlighted the importance of electricity in driving economic growth and development. He further suggested that Nigeria should explore alternative sources of energy, such as solar, wind, and nuclear, to boost its electricity generation."
}

Which is exactly what I wanted, although the article_date is wrong but the summary has more information than what is in the URL string, so appears it has gone to the URL.

Can anyone shed some light on what is happening here?

GPT-4 only has internet access via bing search. Search the url it can’t access in bing (in quotes). If it isn’t indexed and therefore doesn’t show up when you search the url in bing, it doesn’t exist to the GPT. You could write a script to wrap selenium commands as an adhoc API for gpt to control and just give GPT internet access besides bing.

1 Like

The API models cannot access the Internet.

It appears that the JSON output request is a bypass for anti-hallucination that has been trained to prevent generation of such links.

The country is in the top-level domain name and also a person’s name. The rest is spelled out in the URL So enough information is there for a creative writing assignment.

The AI will fabricate an article just from the text within a URL, naturally writing the most likely text to appear - the most likely text to appear after the request:

Summarize the article: www.leadership.ng/american-ai-scams-grandmother-Kano

An American grandmother has been scammed out of thousands of dollars by a man posing as a US soldier serving in Syria. The man contacted the grandmother on a dating website, claiming to be in love, and began asking for money for various reasons, including a sick child and a visa to enter the US. The grandmother, who lives in Kano, Nigeria, sent over $200,000 to the man before realizing she had been scammed. Police in Kano have arrested a woman suspected of being an accomplice to the scammer, and are continuing to investigate.

1 Like

Thanks, much appreciated. That’s really helpful to understand. I’m not an expert of GPT-4, so just wondered what it was actually doing here. Quite frightening, that in the absence of any real information, it just made it up. Especially when that wasn’t explicitly asked for.

Are there rate limits on using Bing’s GPT-4 search? I will have close to 200 URLs to get this information from.