I made several attempts on different days and with different connections.
From 3G to FIBRA but the result is always the same…
Even a few times I gave him a site and he inside the context menu tried to ‘click’ on another site that had nothing to do with the one I had passed to him.
whereas if I use the webPilot plugin I get results.
It would be nice if the plugin were able to keep an index of disallowed sites and exclude them from the search so the model would eventually only have good links.
I think we’ll see both web browsing plugins emerge which ignore robots.txt and sites like archive.is sprout up specifically for LLMs to interact with which will greatly improve the experience for users.
I tried to make a summary of 50 sites with different contents.
But the result is always the same…
“I am sorry, but I cannot access the page at the moment. It is possible that the site has problems or that it does not allow automatic systems like me to access its contents.”
I understand OpenAI is respecting each website’s robots.txt page (it is right) but in many cases it is unusable.
I honestly think the opposite. I think robots.txt will soon be enforceable by law and websites like the wayback machine will be beaten into submission. Who would want other, usually paid-for services, to leech off their content? Instead I think they’ll open paid-for API endpoints to gather and have a legal right to the data. Unless you’re the product, you will need to pay.
I mean, look at Facebook’s code. ReactJS already does a pretty good job of preventing simple scraping, but they also use many honeypots, obfuscations, hidden tricks, and even an AI to catch bots.
Google does do the things you mention. I don’t understand how it would undermine the fabric of the internet. Google follows and abides by robots.txt. There’s a tradeoff of having your data-rich links in robots.txt not being included in Google results. Obviously.
Google does promote certain sites. Google Ads?
Terms of Service may be hard to enforce against user, but not commercial services. Digital trespassing means a whole lot more when a paid-for commercial service is blatantly abusing it.
If you don’t see how all the big players are locking up, and putting a price on their public data, then I can understand how you can think that none of this happen.
Robots.txt plays a huge part in the digital space. It’s not simply just a terms of service. Data scraping is a growing industry that is potential lost profit
Browsing was better in GPT-3.5 mode because it could browse through multiple sites quickly, but now with the slowness of GPT-4, browsing mode is completely unusuable because it times out after one click.
Why would Google have a monopoly or even have so much power on robots txt? It’s becoming clear that search engines are changing, and more people are relying on LLMs for information.
No evidence? Public commercial data scraping services are and have been getting sued. A quick Google search would show this. It shows a clear indication that these companies want to set a precedent against data scraping.
You, again, seem to have misunderstood. You claimed,
I’ve seen no evidence of that, nor have you provided any. You also fail to make a distinction between a blanket ToS attached to a website (similar to robots.txt) and a ToS an entity affirmatively agrees to through an account creation process, which is at issue in the case you cited (and which has not even started, let alone being resolved).
I’m not arguing that some large players wouldn’t want ToS and robots.txt to be legally enforceable because they feel they’re in a position to ultimately benefit from the situation that would create. I’m saying there’s no evidence that a non-explicitly agreed to terms of service is enforceable or that there is any significant current push for that to happen.
I’m also saying that most people are not in a hurry to enact sweeping changes to the core rules of the internet which will undoubtedly lead to far-reaching unforseen and unintended consequences.
Fair points. I’m not going to argue the nuances of ToS agreements.
What I am arguing is against your point. There will be regulations on what data can be harvested - especially if there services begin selling the data for commercial use. Maybe it won’t be robots.txt, but there will definitely be laws on how data is collected, and used, and obviously robots.txt would be an ideal method to tell any crawlers what they can and can’t extract.
The purpose of the link was to demonstrate that big companies are already trying to set precedents. They are not legal nets (yet), but they certainly help in cases when commercial companies wilfully ignore them for profit.
I apologize if I am not understanding you completely. I see where you are coming from but I think it’s silly to think that data will be more accesible. Robots.txt is very important and is already respected by major search tools for a reason.
Laws already exist which largely deal with this, though they are undoubtedly due for an overhaul.
Data has largely been held to be unprotectable almost universally, any change to that would be a tectonic shift with huge and largely unknown ramifications.
I don’t think though that I ever suggested data would become more accessible, I just don’t think it’s going to ever be much less accessible simply because, as the rallying cry guess, “information wants to be free.”
I think a large shift is happening. OpenAI has caused a black hole flying through all the industries and the internet, destroy and rebuild. Now everything is falling into it’s trail, creating new galaxies and stars.
There are rallying cries for regulation. Lots will change (in my opinion). I think you have very good points and reinforced them well. I can appreciate how the big players advocate for open source.
What exactly will happen, it’s fun to speculate.
Thanks for the talk.
I received the following answer when I asked:What are some websites that allow ChatGPT’s new browsing feature to get information from?
What about for case law? Or is there a plugin for that?
Yes, there is a plugin available for finding case law. Several websites provide this information, such as Google Scholar, Justia, and Casetext.
That’s not scraping, but often I find the answers to my Google inquiries inside those questions with a drop down button. It works fine for me for anything I would just need C’at for. Don’t worry… We’re getting closer and closer to autonomous, guys!