Unstable output from GPT: Refuses to regenerate previous success

mdrome · December 10, 2023, 10:55am

I’m using the Open AI API to generate content about any website put into into a web form. I’m testing on one same website, which is a simple sports & entertainment blog. I’ve run over 100 tests. 10% of the times GPT successfully delivers the output. 90% it gives me something like “I’m sorry I can’t assist with this request.” The prompt is always the same. Why does GPT fail to be stable at generating output?

jlvanhulst · December 10, 2023, 4:00pm

What tool are you using to get the content of the website? You might consider using your own ‘function’ if it about simple scraping? A lot of websites have their robots.txt set to block openai crawlers. Here’s my version.


import html2text
import requests
def webScrape(info=None):
    if info==None:
        return {
    "name": "webscrape", "description": "Get the text content of a webpage if 'ignore links' is true, links will be removed from the text", 
    "parameters": { 
        "type": "object", 
        "properties": { 
            "url": { "type": "string", "description": "The URL of the website to scrape" } ,
            "ignore links": { "type": "boolean", "description": "Ignore links in the text. Use 'False' to receive the URLs of nested pages to scrape." },
            "max length": { "type": "integer", "description": "Maximum length of the text to return" }
        },
        "required" : ["url", "ignore links"]
        }
    }
    if "ignore links" in info: 
        ignore = info["ignore links"]
    else:
        ignore = True
    text= html2text.HTML2Text()
    text.ignore_links = ignore
    text.bypass_tables = False
    url = info["url"]
    header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'}
    if not url.startswith('http'):
        url = 'https://'+url
    try:
        h = requests.get(url, headers=header, allow_redirects=True, timeout=5)
    except:
       return ""
    print('succesful webscrape '+url+' '+str(h.status_code))
    if "max length" in info:
        return text.handle(h.text,)[0:info["max length"]]
    else:
        return text.handle(h.text,)

If you call this function without parameters you get the template you need to add to the Assistant.

mdrome · December 13, 2023, 12:07pm

@jlvanhulst Thanks so much for the tip! Your suggestion got me closer. Essentially, I realize the best way to move forward is to develop my own scraping solution and use the Open AI API only for generating content based on the scraped content. GPT can browse the Internet but it’s limited and unreliable at the moment.

Just curious, how would you go about scraping dynamic data(usually delivered by websites with JS) without relying on 3rd party services?

jlvanhulst · December 14, 2023, 8:30pm

Just curious, how would you go about scraping dynamic data(usually delivered by websites with JS) without relying on 3rd party services?

you can write a similar scraper based on Selenium?
But you would only need that if you were ‘interacting’ with the data on the site. Like filing out a form or navigating to get to a place to scrape.

Topic		Replies	Views
GPTs doesn't respond 100% of the time? GPT builders	0	217	March 29, 2024
Configure custom GPT to access and parse website source code GPT builders	2	979	January 23, 2024
GPT for scraping (extracting from) unstructured web pages API	0	2163	December 18, 2023
Is it possible to generate content based on a website? API	5	4943	December 16, 2023
How to access internet with OpenAi/Gpt-4/Api and Google Sheets? API gpt-4	6	2167	December 14, 2023

Unstable output from GPT: Refuses to regenerate previous success

Related topics