I made several attempts on different days and with different connections.
From 3G to FIBRA but the result is always the same…
Even a few times I gave him a site and he inside the context menu tried to ‘click’ on another site that had nothing to do with the one I had passed to him.
whereas if I use the webPilot plugin I get results.
It would be nice if the plugin were able to keep an index of disallowed sites and exclude them from the search so the model would eventually only have good links.
I think we’ll see both web browsing plugins emerge which ignore robots.txt and sites like archive.is sprout up specifically for LLMs to interact with which will greatly improve the experience for users.
I tried to make a summary of 50 sites with different contents.
But the result is always the same…
“I am sorry, but I cannot access the page at the moment. It is possible that the site has problems or that it does not allow automatic systems like me to access its contents.”
I understand OpenAI is respecting each website’s robots.txt page (it is right) but in many cases it is unusable.
I honestly think the opposite. I think robots.txt will soon be enforceable by law and websites like the wayback machine will be beaten into submission. Who would want other, usually paid-for services, to leech off their content? Instead I think they’ll open paid-for API endpoints to gather and have a legal right to the data. Unless you’re the product, you will need to pay.
I mean, look at Facebook’s code. ReactJS already does a pretty good job of preventing simple scraping, but they also use many honeypots, obfuscations, hidden tricks, and even an AI to catch bots.
robots.txt will never be enforceable by law. It would completely undermine the fabric of the internet as we know it.
Additionally, it would make it impossible for any new company to enter the search space.
Beyond that, imagine if it were.
Google (for instance) could then, in theory, promote sites with which they had exclusivity arrangements and even de-list those with which they didn’t.
Creating a bifurcated internet.
No, I think the unintended consequences of a legally enforceable robots.txt file would be far too grievous to ever even entertain that as a possibility.
Besides, website terms of service have already been found to be unenforceable, and what is robots.txt but a very terse terms of service?
Google does do the things you mention. I don’t understand how it would undermine the fabric of the internet. Google follows and abides by robots.txt. There’s a tradeoff of having your data-rich links in robots.txt not being included in Google results. Obviously.
Google does promote certain sites. Google Ads?
Terms of Service may be hard to enforce against user, but not commercial services. Digital trespassing means a whole lot more when a paid-for commercial service is blatantly abusing it.
If you don’t see how all the big players are locking up, and putting a price on their public data, then I can understand how you can think that none of this happen.
Robots.txt plays a huge part in the digital space. It’s not simply just a terms of service. Data scraping is a growing industry that is potential lost profit
Browsing was better in GPT-3.5 mode because it could browse through multiple sites quickly, but now with the slowness of GPT-4, browsing mode is completely unusuable because it times out after one click.
Google does do the things you mention. I don’t understand how it would undermine the fabric of the internet. Google follows and abides by robots.txt.
You misunderstood. I’m not talking about what Google does or doesn’t do or even what they could do so much as what they could compel other sites to do.
There’s a difference if robots.txt becomes a legal document. In that case, there is an incentive for a player like Google to force sites to weaponize robots.txt.
Imagine if 3/4 of the sites on the internet were compelled set up their sites to disallow all, only whitelisting Google?
This may not be a particularly likely outcome, but a legally enforceable robots.txt file could lead to a world where people need to use 2,3, 15 different search engines to find relevant information.
Terms of Service may be hard to enforce against user, but not commercial services. Digital trespassing means a whole lot more when a paid-for commercial service is blatantly abusing it.
I have yet to see any evidence of that.
If you don’t see how all the big players are locking up, and putting a price on their public data, then I can understand how you can think that none of this happen.
How none of what happens?
Robots.txt plays a huge part in the digital space. It’s not simply just a terms of service. Data scraping is a growing industry that is potential lost profit.
Sure, but… a legally enforceable robots.txt isn’t the answer, it’s unlikely to be about to actually be enforced, and it’s not something I think anyone is actually actively considering.
Why would Google have a monopoly or even have so much power on robots txt? It’s becoming clear that search engines are changing, and more people are relying on LLMs for information.
No evidence? Public commercial data scraping services are and have been getting sued. A quick Google search would show this. It shows a clear indication that these companies want to set a precedent against data scraping.
It’s not even about search engine results. It’s about companies utilizing other companies data through harvesting, and then packaging it as a service or product.
You’re right, maybe it’s not exactly robots.txt that becomes the legal document. It is very obvious though that something will, and that data will become harder and harder to legally use.
You, again, seem to have misunderstood. You claimed,
I’ve seen no evidence of that, nor have you provided any. You also fail to make a distinction between a blanket ToS attached to a website (similar to robots.txt) and a ToS an entity affirmatively agrees to through an account creation process, which is at issue in the case you cited (and which has not even started, let alone being resolved).
I’m not arguing that some large players wouldn’t want ToS and robots.txt to be legally enforceable because they feel they’re in a position to ultimately benefit from the situation that would create. I’m saying there’s no evidence that a non-explicitly agreed to terms of service is enforceable or that there is any significant current push for that to happen.
I’m also saying that most people are not in a hurry to enact sweeping changes to the core rules of the internet which will undoubtedly lead to far-reaching unforseen and unintended consequences.
Fair points. I’m not going to argue the nuances of ToS agreements.
What I am arguing is against your point. There will be regulations on what data can be harvested - especially if there services begin selling the data for commercial use. Maybe it won’t be robots.txt, but there will definitely be laws on how data is collected, and used, and obviously robots.txt would be an ideal method to tell any crawlers what they can and can’t extract.
The purpose of the link was to demonstrate that big companies are already trying to set precedents. They are not legal nets (yet), but they certainly help in cases when commercial companies wilfully ignore them for profit.
I apologize if I am not understanding you completely. I see where you are coming from but I think it’s silly to think that data will be more accesible. Robots.txt is very important and is already respected by major search tools for a reason.
Laws already exist which largely deal with this, though they are undoubtedly due for an overhaul.
Data has largely been held to be unprotectable almost universally, any change to that would be a tectonic shift with huge and largely unknown ramifications.
I don’t think though that I ever suggested data would become more accessible, I just don’t think it’s going to ever be much less accessible simply because, as the rallying cry guess, “information wants to be free.”
I think a large shift is happening. OpenAI has caused a black hole flying through all the industries and the internet, destroy and rebuild. Now everything is falling into it’s trail, creating new galaxies and stars.
There are rallying cries for regulation. Lots will change (in my opinion). I think you have very good points and reinforced them well. I can appreciate how the big players advocate for open source.
What exactly will happen, it’s fun to speculate.
Thanks for the talk.
I received the following answer when I asked:What are some websites that allow ChatGPT’s new browsing feature to get information from?
Google
Bing
Yahoo
Ask
DuckDuckGo
Wolfram Alpha
Quora
Stack Overflow
Wikipedia
Reddit
What about for case law? Or is there a plugin for that?
Yes, there is a plugin available for finding case law. Several websites provide this information, such as Google Scholar, Justia, and Casetext.
That’s not scraping, but often I find the answers to my Google inquiries inside those questions with a drop down button. It works fine for me for anything I would just need C’at for. Don’t worry… We’re getting closer and closer to autonomous, guys!
In fact, WebPilot plugin performs much better if OpenAI’s web crawler is unable to support JavaScript and continue to strictly adhere to the robot.txt protocol.