Any tools out there to pull/scrape complete website data and feed it into GPT?

fak500 · October 11, 2023, 12:16am

I am trying to convert a website into chatbot version with FAQs. I want to extract text from all the site links and then auto-categorize and feed into GPT. Anyone out there who has done this?

lachie1 · October 11, 2023, 12:47am

I have used Python tools like Beautiful Soup and Selenium before

_j · October 11, 2023, 12:54am

Yes, there are tools available to scrape website data and feed it into GPT. Some popular options include BeautifulSoup, Scrapy, and Selenium. These tools allow you to extract text from websites and organize it into a format that can be easily fed into GPT for training or use in a chatbot. You may also want to consider using a web crawler to automatically navigate through the website and gather data from multiple pages. However, be sure to check the website’s terms of service before scraping any data, as some sites may prohibit this practice.

(this is an answer from an AI trained to answer this very question…

You can also just let loose a wget session on a site. The “feeding” of an entire site can be into an embedding vector database if you just want to add semantic search instead of making the AI go run functions itself.

(answer from me)

fak500 · October 11, 2023, 1:17am

Right. Thanks for sharing.

Are there no APIs we can use that do all the scraping? I think there is a limit to what we can feed into GPT so that would just make it harder if we’re trying to feed it a website with 20+ pages/site links

_j · October 11, 2023, 1:21am

Yes, you cannot simply “give a web site” to a language model. The amount of custom input AI can accept is limited. You must use techniques for providing parts of the knowledge that would be relevant to the current user input. Your keyword search is “embeddings vector database retrieval augmented generation”.

You also can provide a function that can browse the data, much as ChatGPT with Bing Browse can get search results and go after a site’s content’s directly.

peternjenga · October 11, 2023, 1:22am

@fak500 , you have a few options. As the previous users mentioned, you can try out these tools to get started

However, for any modestly complex application, you’ll soon realize that these tools are not sufficient. If you scrape a considerable portion of a website, you’ll get blocked if the website is protected by DDOS/anti-scrapers like Cloudflare. This is especially the case if you are coming from a public cloud provider like AWS, OVH or GCP.

You can attempt to use a proxy provider/Scrapping API like Brightdata which will mitigate the problem. However, the drawback to this solution is that you have to pay for unblocking which can be costly and add additional latency.

At ReframeAI (Founder here), we are building an execution framework that enables you to deeply visit websites at scale on your dataframes. Reframe execution engine is open source . It enables you to create executable workflows that link Large Language Models (LLMs), Prompts and Python functions together in a directed acyclic graph. With Reframe, you can create complex workflows that operate on data tables - thereby taking advantage of the similarities and co-dependencies amongst data.

Benefits of using ReframeAI are.

Overcome basic scraping blockers and issues.
Execution engine handles cases where a lot of the content is similar, deciphering interdependencies among data and only extracting key pieces of information.
Flexible opensource execution engine which you can run on your servers or use our hosted, managed solution.

Site
Github
Discord

anon10827405 · October 11, 2023, 1:25am

Which should be enough to indicate that these people do not want their websites to be scraped.

@fak500

I implore you to try and see if the web host offers a direct API, or even contacting them for the data first. Most of them are happy to provide it, even for a small fee.

There are but it’s very hard to have a general-purpose tool that can respect all the nuances of websites. If the web host doesn’t offer any solution you may want to consider just using a service like Fiverr. Seriously, you can probably get all of this information for less than $15 USD

peternjenga · October 11, 2023, 1:32am

Echoing @anon10827405 said, you are putting yourself in an adversarial relationship with the sites you are scraping. Even proxy providers like Brightdata will block your access if they notice you are scrapping portions of sites that the creators intended to keep private and/or don’t respect robots.txt.

You options are:

Hire someone from Fiverr/Upwork/Mechanical turk
Use API calls. At reframeAI, we scrape sites as a last resort only. Utilizing API calls or 3rd party databases before attempting to scrape websites.

What kind of sites are you looking to scape anyway and at what scale?

fak500 · October 11, 2023, 1:42am

Just to add to my question - we will be asking users for their consent before scraping their data. If their provider blocks us then that’s a completely different story.

There is a chat gpt plug in called browserOP that does it really well and wonder how they did it.

Foo-Bar · November 16, 2023, 12:47pm

I am going to build one “scraper GPT” myself this weekend.

Mind you : I am scraping my own sites so that’s allowed.

Workflow

So I will create something like this;

Conversation starter button with GoFetch!
Value after GoFetch! will be the URL / URI
GoFetch! https://foo.bar/hello/world.htm will trigger API
GPT API will send URL to my server
Server will CURL to the URL it received from GPT
Server sends back scraped data to GPT
GPT can do whatever it wants with the data *

In my case : translating, rewriting and creating headlines from the article itself

shawnharris963 · November 30, 2023, 7:32pm

You can use some free tools which allow website scraping to paste to a GPT

fak500 · November 30, 2023, 8:18pm

Like which ones? Have you tried any?

timsu · December 5, 2023, 5:37am

You might want to have a look at github . com/ BuilderIO/ gpt-crawler - would be curious how that one works for you! (can’t include links in my post, hence the whitespace)

lostinsauce · December 16, 2023, 6:55pm

Just a general FYI for the thread, there’s a scraping API service called ZenRows that gets past things like Cloudflares without any additional setup. It’s a paid service, but very handy if you don’t want to get into building a complete crawler/scraper yourself

tconnors · December 21, 2023, 9:03pm

I was looking for an answer to this too but didn’t find anything fitting, so I built my own web scraper with GPT + Vision + Playwright. Is uses the Assistant API to iteratively identify HTML elements and write code to interact with them in a headless browser environment.

If anyone is interested in building something similar I made a full write-up here:

fak500 · December 22, 2023, 7:11am

can you share your code? I was wondering if this could be done for extracting images

rehit.it · January 13, 2024, 2:05pm

I have a website (and database) with 2,737,365 posts in 201,814 topics by 95,090 members on the subject of having babies and looking after them in the early years. I was thinking this dataset might be a good candidate for training an AI model. Does anyone have any advice of how best to achieve this?

bloom · January 13, 2024, 3:36pm

I’ve used something called OutwitHub before. Probably outdated by now but just throwing it out there

EduGPT · February 1, 2024, 9:28pm

Awesome work. I have tried several approaches as well. Connecting and endpoint to the Chrome API helps.

matcha72 · February 1, 2024, 9:31pm

You can also include Web scraper GPT action from https://gpt-auth.com/ in your GPT to provide it the functionality of scraping content itself dynamically

Topic		Replies	Views
Turn any website into an API with GPT-4 Community gpt-4 , api	12	10884	December 22, 2023
GPT for scraping (extracting from) unstructured web pages API	0	2069	December 18, 2023
Unstable output from GPT: Refuses to regenerate previous success API	3	726	December 14, 2023
How is ChatGPT able to extract webpages so quickly? API chatgpt	3	1304	July 9, 2024
Create an IA which will crawl the pages and talk about it Community chatgpt	9	2020	January 29, 2024

Any tools out there to pull/scrape complete website data and feed it into GPT?

Workflow

Related topics