Website/domain classification

I need to find a way to classify websites/domains into several categories (e.g., the ability to precisely answer the question “Is this website a web design agency”->yes|no).

I found out that the results provided by chatgpt-4o API aren’t reliable enough. I have tested both approaches - directly asking for a domain category and providing HTML META+text.

I just learned it is possible to fine-tune the ChatGpt models. Is this the way to refine the results? Or should I use Embeddings instead? Or completely different technology/solution?

I can prepare a training dataset consisting of let’s say 500 (yes)+500 (no) meta+text website samples.

You’re asking from older training knowledge. I doubt the AI has been trained on a directory of million of domain names instead of useful world knowledge. It will make things up.

I would write a function that is a web page scraper, and then an advanced selenium one if the AI can’t read any dynamic content or metadata.

Or just have it use a search engine that takes “site:mystery.ai” search term and gives results only for that site to read page descriptions. There’s a duckduckgo scraper for Python.