Unstable output from GPT: Refuses to regenerate previous success

I’m using the Open AI API to generate content about any website put into into a web form. I’m testing on one same website, which is a simple sports & entertainment blog. I’ve run over 100 tests. 10% of the times GPT successfully delivers the output. 90% it gives me something like “I’m sorry I can’t assist with this request.” The prompt is always the same. Why does GPT fail to be stable at generating output?

What tool are you using to get the content of the website? You might consider using your own ‘function’ if it about simple scraping? A lot of websites have their robots.txt set to block openai crawlers. Here’s my version.


import html2text
import requests
def webScrape(info=None):
    if info==None:
        return {
    "name": "webscrape", "description": "Get the text content of a webpage if 'ignore links' is true, links will be removed from the text", 
    "parameters": { 
        "type": "object", 
        "properties": { 
            "url": { "type": "string", "description": "The URL of the website to scrape" } ,
            "ignore links": { "type": "boolean", "description": "Ignore links in the text. Use 'False' to receive the URLs of nested pages to scrape." },
            "max length": { "type": "integer", "description": "Maximum length of the text to return" }
        },
        "required" : ["url", "ignore links"]
        }
    }
    if "ignore links" in info: 
        ignore = info["ignore links"]
    else:
        ignore = True
    text= html2text.HTML2Text()
    text.ignore_links = ignore
    text.bypass_tables = False
    url = info["url"]
    header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'}
    if not url.startswith('http'):
        url = 'https://'+url
    try:
        h = requests.get(url, headers=header, allow_redirects=True, timeout=5)
    except:
       return ""
    print('succesful webscrape '+url+' '+str(h.status_code))
    if "max length" in info:
        return text.handle(h.text,)[0:info["max length"]]
    else:
        return text.handle(h.text,)

If you call this function without parameters you get the template you need to add to the Assistant.

2 Likes

@jlvanhulst Thanks so much for the tip! Your suggestion got me closer. Essentially, I realize the best way to move forward is to develop my own scraping solution and use the Open AI API only for generating content based on the scraped content. GPT can browse the Internet but it’s limited and unreliable at the moment.

Just curious, how would you go about scraping dynamic data(usually delivered by websites with JS) without relying on 3rd party services?

Just curious, how would you go about scraping dynamic data(usually delivered by websites with JS) without relying on 3rd party services?

you can write a similar scraper based on Selenium?
But you would only need that if you were ‘interacting’ with the data on the site. Like filing out a form or navigating to get to a place to scrape.

1 Like