Any tools out there to pull/scrape complete website data and feed it into GPT?

I am trying to convert a website into chatbot version with FAQs. I want to extract text from all the site links and then auto-categorize and feed into GPT. Anyone out there who has done this?

I have used Python tools like Beautiful Soup and Selenium before

1 Like

Yes, there are tools available to scrape website data and feed it into GPT. Some popular options include BeautifulSoup, Scrapy, and Selenium. These tools allow you to extract text from websites and organize it into a format that can be easily fed into GPT for training or use in a chatbot. You may also want to consider using a web crawler to automatically navigate through the website and gather data from multiple pages. However, be sure to check the website’s terms of service before scraping any data, as some sites may prohibit this practice.

(this is an answer from an AI trained to answer this very question…

You can also just let loose a wget session on a site. The “feeding” of an entire site can be into an embedding vector database if you just want to add semantic search instead of making the AI go run functions itself.

(answer from me)

2 Likes

Right. Thanks for sharing.

Are there no APIs we can use that do all the scraping? I think there is a limit to what we can feed into GPT so that would just make it harder if we’re trying to feed it a website with 20+ pages/site links

Yes, you cannot simply “give a web site” to a language model. The amount of custom input AI can accept is limited. You must use techniques for providing parts of the knowledge that would be relevant to the current user input. Your keyword search is “embeddings vector database retrieval augmented generation”.

You also can provide a function that can browse the data, much as ChatGPT with Bing Browse can get search results and go after a site’s content’s directly.

1 Like

@fak500 , you have a few options. As the previous users mentioned, you can try out these tools to get started

However, for any modestly complex application, you’ll soon realize that these tools are not sufficient. If you scrape a considerable portion of a website, you’ll get blocked if the website is protected by DDOS/anti-scrapers like Cloudflare. This is especially the case if you are coming from a public cloud provider like AWS, OVH or GCP.

You can attempt to use a proxy provider/Scrapping API like Brightdata which will mitigate the problem. However, the drawback to this solution is that you have to pay for unblocking which can be costly and add additional latency.

At ReframeAI (Founder here), we are building an execution framework that enables you to deeply visit websites at scale on your dataframes. Reframe execution engine is open source . It enables you to create executable workflows that link Large Language Models (LLMs), Prompts and Python functions together in a directed acyclic graph. With Reframe, you can create complex workflows that operate on data tables - thereby taking advantage of the similarities and co-dependencies amongst data.

Benefits of using ReframeAI are.

  • Overcome basic scraping blockers and issues.
  • Execution engine handles cases where a lot of the content is similar, deciphering interdependencies among data and only extracting key pieces of information.
  • Flexible opensource execution engine which you can run on your servers or use our hosted, managed solution.

:globe_with_meridians: Site
:octopus: Github
:speech_balloon: Discord

1 Like

Which should be enough to indicate that these people do not want their websites to be scraped.

@fak500

I implore you to try and see if the web host offers a direct API, or even contacting them for the data first. Most of them are happy to provide it, even for a small fee.

There are but it’s very hard to have a general-purpose tool that can respect all the nuances of websites. If the web host doesn’t offer any solution you may want to consider just using a service like Fiverr. Seriously, you can probably get all of this information for less than $15 USD

2 Likes

Echoing @RonaldGRuckus said, you are putting yourself in an adversarial relationship with the sites you are scraping. Even proxy providers like Brightdata will block your access if they notice you are scrapping portions of sites that the creators intended to keep private and/or don’t respect robots.txt.

You options are:

  1. Hire someone from Fiverr/Upwork/Mechanical turk
  2. Use API calls. At reframeAI, we scrape sites as a last resort only. Utilizing API calls or 3rd party databases before attempting to scrape websites.

What kind of sites are you looking to scape anyway and at what scale?

3 Likes

Just to add to my question - we will be asking users for their consent before scraping their data. If their provider blocks us then that’s a completely different story.

There is a chat gpt plug in called browserOP that does it really well and wonder how they did it.

2 Likes

I am going to build one “scraper GPT” myself this weekend.

Mind you : I am scraping my own sites :slight_smile: so that’s allowed.


Workflow

So I will create something like this;

  1. Conversation starter button with GoFetch!
  2. Value after GoFetch! will be the URL / URI
  3. GoFetch! https://foo.bar/hello/world.htm will trigger API
  4. GPT API will send URL to my server
  5. Server will CURL to the URL it received from GPT
  6. Server sends back scraped data to GPT
  7. GPT can do whatever it wants with the data *

  • In my case : translating, rewriting and creating headlines from the article itself
1 Like

You can use some free tools which allow website scraping to paste to a GPT

Like which ones? Have you tried any?

You might want to have a look at github . com/ BuilderIO/ gpt-crawler - would be curious how that one works for you! (can’t include links in my post, hence the whitespace)

Just a general FYI for the thread, there’s a scraping API service called ZenRows that gets past things like Cloudflares without any additional setup. It’s a paid service, but very handy if you don’t want to get into building a complete crawler/scraper yourself

1 Like

I was looking for an answer to this too but didn’t find anything fitting, so I built my own web scraper with GPT + Vision + Playwright. Is uses the Assistant API to iteratively identify HTML elements and write code to interact with them in a headless browser environment.

If anyone is interested in building something similar I made a full write-up here:

4 Likes

can you share your code? I was wondering if this could be done for extracting images

1 Like

I have a website (and database) with 2,737,365 posts in 201,814 topics by 95,090 members on the subject of having babies and looking after them in the early years. I was thinking this dataset might be a good candidate for training an AI model. Does anyone have any advice of how best to achieve this?

1 Like

I’ve used something called OutwitHub before. Probably outdated by now but just throwing it out there

Awesome work. I have tried several approaches as well. Connecting and endpoint to the Chrome API helps.

You can also include Web scraper GPT action from https://gpt-auth.com/ in your GPT to provide it the functionality of scraping content itself dynamically