I got frustrated with the time and effort required to code and maintain custom web scrapers, so I built a more generic LLM-based solution for data extraction from websites (and potentially other sources). AI should automate tedious and un-creative work, and web scraping definitely fits this description.
One of the killer use cases of large language models like GPT is reformatting information from any format X to any other format Y, so I leveraged that to generate web scrapers and data processing steps on the fly. The big advantage over traditional scraping is that it’s adaptable to website changes and basically maintenance free.
Check it out at Kadoa.com and let me know what you think!
Here are some examples:
So you have access my api now? open ai said keep it secret
it gives me error no leak i see
Thanks for the feedback. Just changed the example and added a robot.txt scan.
As you can see in the network tab we never send your key to any other endpoint than OpenAI. Your example didn’t work because the description wasn’t specific enough. What fields are you trying to extract and from which specific site?
That was quick!
Good luck in your endeavor.
Looking forward to seeing the progress
Thanks! Let me know if you have any additional feedback
Can you also handle sites with pagination, infinite scroll, and search filters? E.g., if I would like to extract many hundreds of records?
Yes, the service has simple RPA capabilities like click automation and scrolling. This is not part of the public demo yet though.
Update: we removed the need for an OpenAI key, so you can now try it out for free
Tried it out. Honestly, I found it to be very slow and not any better than any of the other billion commercial off-the-shelf scrapers out there.
The initial scraper generation that is showcased on the playground is indeed quite slow. The cool thing is that the data extraction is basically fully autonomous after the first configuration and automatically adapts to any website changes. Current solutions require constant maintenance.
Update: We’re now detecting all entities and their properties on a website, so you can conveniently select the data you want to extract from any website. We’ve also shipped some major performance improvements.