GPT for scraping (extracting from) unstructured web pages

parakeet · December 18, 2023, 8:37pm

Does anyone have any experience with using OpenAI via API for extracting data from web pages?

Imagine wanting to grab news article or blog post data (title, url, summary, date, picture url) from an index page that doesn’t itself have an RSS or XML file.

I’m familiar with web scraping, and know that you can use Puppeteer, Python etc to extract data - but basically only when you know the paths of elements to get. If those change, extractions will fail. If you want to extract from a large set, that would be problematic.

I imagine just throwing the index page at GPT, but that seems expensive in terms of token use.
Another idea might be to first generate plain text of that input index page.
Still another idea may be just to use GPT to look at the index page, say, weekly, generating the required element paths to a library like Puppeteer, dedicated to extraction. This would cut down on token usage.

Topic		Replies	Views
Implementing WebGPT API	2	1410	September 30, 2022
Turn any website into an API with GPT-4 Community gpt-4 , api	12	11194	December 22, 2023
How to implement GPT4 API with internet access? API gpt-4 , api	10	19998	March 12, 2025
Unstable output from GPT: Refuses to regenerate previous success API	3	748	December 14, 2023
How to query the API to summarize internet sources? API api , gpt-4o	3	267	December 17, 2024

GPT for scraping (extracting from) unstructured web pages

Related topics