How is ChatGPT able to extract webpages so quickly?

I am trying to make my own chatbot app but am struggling with integrating with scraping data from webpages, here is how I am doing it now:

  1. Use LLM to generate web search queries when needed
  2. For each web search query, run SerpAPI
  3. Use ScrapingBee on the top 3 organic results webpages in SerpAPI results
  4. Store text from the scraping results in vector database
  5. Retrieve relevant webpage text chunks from vector database based on vector search
  6. Return chunk to LLM to process output to user

However I am running into a problem where each instance of this process is taking over 40-60 seconds, way too long compared to ChatGPT and Perplexity. Does anyone know what I am doing wrong?

It sounds like you’ve put a lot of effort into creating your chatbot! Honestly, I’d recommend asking this question to ChatGPT—maybe your bot can take a break and learn a thing or two from a more experienced sibling. :smile:

1 Like

Probably using Bing cache.

There are multiple things you can do.

Actually, build a few functions without using scrapingbee. Through multiple function-calling, the AI is smart enough to use each function when neccesary.

You can first try a simple fetch, where you extract the HTML and convert it to readable text with MozillaReadable. Most of the time, this will work pretty well.

If a simple fetch doesn’t work (which happens because of website restrictions), let the website fetch with Cheerio. This is a pretty fast approach, but unfortunately, no JS support. So might not be ideal for some websites.

So, if Cheerio also fails for some reason, let the AI use Puppeteer. This is what ScrapingBee is also using. This is actually a full Chrome instance that can scrape everything.

Don’t vectorize the content though. Just clean the HTML with something like MozillaReadable and send 40k characters directly to the AI, which is more then enough most of the time. With GPT-4o it’s cheap enough and this is the fastest approach without anything fancy.

1 Like