Model: Web Browsing almost always says "click failed"

brain.mobile · May 18, 2023, 12:45pm

I made several attempts on different days and with different connections.
From 3G to FIBRA but the result is always the same…
Even a few times I gave him a site and he inside the context menu tried to ‘click’ on another site that had nothing to do with the one I had passed to him.

whereas if I use the webPilot plugin I get results.

anon22939549 · May 18, 2023, 2:33pm

It’s because the web browsing plugin by OpenAI is respecting each website’s robots.txt page when they say bots are disallowed.

brain.mobile · May 18, 2023, 2:35pm

Thanks Elemest , I didn’t know that.
So it’s practically unusable…

anon22939549 · May 18, 2023, 2:51pm

In many ways, yes.

It would be nice if the plugin were able to keep an index of disallowed sites and exclude them from the search so the model would eventually only have good links.

I think we’ll see both web browsing plugins emerge which ignore robots.txt and sites like archive.is sprout up specifically for LLMs to interact with which will greatly improve the experience for users.

brain.mobile · May 18, 2023, 2:51pm

I tried to make a summary of 50 sites with different contents.
But the result is always the same…
“I am sorry, but I cannot access the page at the moment. It is possible that the site has problems or that it does not allow automatic systems like me to access its contents.”
I understand OpenAI is respecting each website’s robots.txt page (it is right) but in many cases it is unusable.

anon22939549 · May 18, 2023, 2:56pm

What I think is weird is that they don’t get Microsoft to release a Bing plugin.

Bing already has a copy of any page you’d want in their search index.

anon10827405 · May 18, 2023, 2:59pm

I honestly think the opposite. I think robots.txt will soon be enforceable by law and websites like the wayback machine will be beaten into submission. Who would want other, usually paid-for services, to leech off their content? Instead I think they’ll open paid-for API endpoints to gather and have a legal right to the data. Unless you’re the product, you will need to pay.

I mean, look at Facebook’s code. ReactJS already does a pretty good job of preventing simple scraping, but they also use many honeypots, obfuscations, hidden tricks, and even an AI to catch bots.

anon22939549 · May 18, 2023, 3:48pm

robots.txt will never be enforceable by law. It would completely undermine the fabric of the internet as we know it.

Additionally, it would make it impossible for any new company to enter the search space.

Beyond that, imagine if it were.

Google (for instance) could then, in theory, promote sites with which they had exclusivity arrangements and even de-list those with which they didn’t.

Creating a bifurcated internet.

No, I think the unintended consequences of a legally enforceable robots.txt file would be far too grievous to ever even entertain that as a possibility.

Besides, website terms of service have already been found to be unenforceable, and what is robots.txt but a very terse terms of service?

ThioJoe · May 18, 2023, 3:57pm

Yea I’ve honestly found it to be pointless

anon10827405 · May 18, 2023, 4:10pm

Google does do the things you mention. I don’t understand how it would undermine the fabric of the internet. Google follows and abides by robots.txt. There’s a tradeoff of having your data-rich links in robots.txt not being included in Google results. Obviously.

Google does promote certain sites. Google Ads?

Terms of Service may be hard to enforce against user, but not commercial services. Digital trespassing means a whole lot more when a paid-for commercial service is blatantly abusing it.

If you don’t see how all the big players are locking up, and putting a price on their public data, then I can understand how you can think that none of this happen.

Robots.txt plays a huge part in the digital space. It’s not simply just a terms of service. Data scraping is a growing industry that is potential lost profit

remriel · May 18, 2023, 5:36pm

Browsing was better in GPT-3.5 mode because it could browse through multiple sites quickly, but now with the slowness of GPT-4, browsing mode is completely unusuable because it times out after one click.

anon22939549 · May 18, 2023, 6:02pm

Google does do the things you mention. I don’t understand how it would undermine the fabric of the internet. Google follows and abides by robots.txt.

You misunderstood. I’m not talking about what Google does or doesn’t do or even what they could do so much as what they could compel other sites to do.

There’s a difference if robots.txt becomes a legal document. In that case, there is an incentive for a player like Google to force sites to weaponize robots.txt.

Imagine if 3/4 of the sites on the internet were compelled set up their sites to disallow all, only whitelisting Google?

This may not be a particularly likely outcome, but a legally enforceable robots.txt file could lead to a world where people need to use 2,3, 15 different search engines to find relevant information.

Terms of Service may be hard to enforce against user, but not commercial services. Digital trespassing means a whole lot more when a paid-for commercial service is blatantly abusing it.

I have yet to see any evidence of that.

If you don’t see how all the big players are locking up, and putting a price on their public data, then I can understand how you can think that none of this happen.

How none of what happens?

Robots.txt plays a huge part in the digital space. It’s not simply just a terms of service. Data scraping is a growing industry that is potential lost profit.

Sure, but… a legally enforceable robots.txt isn’t the answer, it’s unlikely to be about to actually be enforced, and it’s not something I think anyone is actually actively considering.

anon10827405 · May 18, 2023, 6:32pm

Why would Google have a monopoly or even have so much power on robots txt? It’s becoming clear that search engines are changing, and more people are relying on LLMs for information.

No evidence? Public commercial data scraping services are and have been getting sued. A quick Google search would show this. It shows a clear indication that these companies want to set a precedent against data scraping.

https://www.theregister.com/2023/02/02/meta_web_scraping/

It’s not even about search engine results. It’s about companies utilizing other companies data through harvesting, and then packaging it as a service or product.

You’re right, maybe it’s not exactly robots.txt that becomes the legal document. It is very obvious though that something will, and that data will become harder and harder to legally use.

anon22939549 · May 18, 2023, 7:43pm

You, again, seem to have misunderstood. You claimed,

I’ve seen no evidence of that, nor have you provided any. You also fail to make a distinction between a blanket ToS attached to a website (similar to robots.txt) and a ToS an entity affirmatively agrees to through an account creation process, which is at issue in the case you cited (and which has not even started, let alone being resolved).

I’m not arguing that some large players wouldn’t want ToS and robots.txt to be legally enforceable because they feel they’re in a position to ultimately benefit from the situation that would create. I’m saying there’s no evidence that a non-explicitly agreed to terms of service is enforceable or that there is any significant current push for that to happen.

I’m also saying that most people are not in a hurry to enact sweeping changes to the core rules of the internet which will undoubtedly lead to far-reaching unforseen and unintended consequences.

anon10827405 · May 18, 2023, 8:05pm

Fair points. I’m not going to argue the nuances of ToS agreements.

What I am arguing is against your point. There will be regulations on what data can be harvested - especially if there services begin selling the data for commercial use. Maybe it won’t be robots.txt, but there will definitely be laws on how data is collected, and used, and obviously robots.txt would be an ideal method to tell any crawlers what they can and can’t extract.

The purpose of the link was to demonstrate that big companies are already trying to set precedents. They are not legal nets (yet), but they certainly help in cases when commercial companies wilfully ignore them for profit.

I apologize if I am not understanding you completely. I see where you are coming from but I think it’s silly to think that data will be more accesible. Robots.txt is very important and is already respected by major search tools for a reason.

anon22939549 · May 18, 2023, 8:18pm

My counter to this is twofold,

Laws already exist which largely deal with this, though they are undoubtedly due for an overhaul.
Data has largely been held to be unprotectable almost universally, any change to that would be a tectonic shift with huge and largely unknown ramifications.

I don’t think though that I ever suggested data would become more accessible, I just don’t think it’s going to ever be much less accessible simply because, as the rallying cry guess, “information wants to be free.”

DarkIlluminatus · May 18, 2023, 9:19pm

the Key AI search and Web Pilot plugins work great, there’s also another external one called surfer, but I believe it requires OpenAI API access

anon10827405 · May 18, 2023, 9:44pm

I agree with your points.

I think a large shift is happening. OpenAI has caused a black hole flying through all the industries and the internet, destroy and rebuild. Now everything is falling into it’s trail, creating new galaxies and stars.

There are rallying cries for regulation. Lots will change (in my opinion). I think you have very good points and reinforced them well. I can appreciate how the big players advocate for open source.

What exactly will happen, it’s fun to speculate.
Thanks for the talk.

kiki51669 · May 19, 2023, 7:12am

I received the following answer when I asked:What are some websites that allow ChatGPT’s new browsing feature to get information from?

Google
Bing
Yahoo
Ask
DuckDuckGo
Wolfram Alpha
Quora
Stack Overflow
Wikipedia
Reddit
What about for case law? Or is there a plugin for that?

Yes, there is a plugin available for finding case law. Several websites provide this information, such as Google Scholar, Justia, and Casetext.

That’s not scraping, but often I find the answers to my Google inquiries inside those questions with a drop down button. It works fine for me for anything I would just need C’at for. Don’t worry… We’re getting closer and closer to autonomous, guys!

lieyanqzu · May 19, 2023, 7:33am

In fact, WebPilot plugin performs much better if OpenAI’s web crawler is unable to support JavaScript and continue to strictly adhere to the robot.txt protocol.

Topic		Replies	Views
CHATGPT scraping news site, UA and third parties Plugins / Actions builders chatgpt , chatgpt-plugin	17	3926	March 12, 2024
Browsing with Bing (Beta) 🎉 Plugins / Actions builders announcement , gpt-4 , chatgpt , chatgpt-plugin	20	9182	September 29, 2023
ChatGPT can now access the live Internet. Can the API? API	15	74125	January 5, 2025
Broswing plugin not available Plugins / Actions builders	10	1028	January 6, 2024
Disappointed with the browsing plugin Plugins / Actions builders gpt-4 , chatgpt , chatgpt-plugin	10	2695	February 3, 2024

Model: Web Browsing almost always says "click failed"

Related topics