GPT for scraping (extracting from) unstructured web pages

Does anyone have any experience with using OpenAI via API for extracting data from web pages?

Imagine wanting to grab news article or blog post data (title, url, summary, date, picture url) from an index page that doesn’t itself have an RSS or XML file.

I’m familiar with web scraping, and know that you can use Puppeteer, Python etc to extract data - but basically only when you know the paths of elements to get. If those change, extractions will fail. If you want to extract from a large set, that would be problematic.

  1. I imagine just throwing the index page at GPT, but that seems expensive in terms of token use.
  2. Another idea might be to first generate plain text of that input index page.
  3. Still another idea may be just to use GPT to look at the index page, say, weekly, generating the required element paths to a library like Puppeteer, dedicated to extraction. This would cut down on token usage.
1 Like