Website content categorization

Hi, I plan to use OpenAI to CATEGORIZE WEBSITES based on their content.

Fore example:

match the content of the following websites with the following categories

websites:
chiotsrun. com
moz. com
healthline. com
cargurus. com

categories:
marketing
automotive
gardening
health

chiotsrun. com - gardening
moz. com - marketing
healthline. com - health
cargurus. com - automotive

My limited tests gave me good results. But I have a question: “As your data stops at 2019, will it be possible for OpenAI to identify content category of a new website (eg created at 2021)?”

Is there any better way to categorize websites based on their content?

Thank you.

If you can get some metadata or just a bit of code from the main page, I bet it will work just fine.

1 Like

Like David said, you could first fetch a simplified (clean text) version of the homepage of each website (perhaps with meta keywords, meta description, and even the main menu structure) and feed that to a GPT classifier

2 Likes

Hi, thank you for the tips.

I can scrape meta data, and more thats not a problem.

Thats what I need to do for new websites (the ones created after 2019) correct, I just wanted to confirm that I understood correctly?

Can you give some more direction on how can I use GPT classifier, maybe an example?

That would be great start for me?

Regards…

I would do it for all websites, not sure if relying on the GPT3 training is good enough to classify a website.

Just make a prompt, feed data into it and ask GPT3 what you want.
here’s an example

Thank you for the clarifications…

I have 500 categories and 100.000 websites to categorize.

Do I need to run above prompt and list categories again and again for EACH website: that would spend a lot of credits? Or is there any more efficient way to do that?

Thanks

How else are you going to categorize each website without sending the contents of the website to the API?

this is not what I meant, let me clarify…

for example there are 4 categories in below example, what if there are 500 categories…


match the content of the following websites with the following categories

websites:
chiotsrun. com
moz. com
healthline. com
cargurus. com

categories:
marketing
automotive
gardening
health

chiotsrun. com - gardening
moz. com - marketing
healthline. com - health
cargurus. com - automotive

Ahh, I understand the issue, thanks.

It might push the limits of fine tuning, but you might try a dataset with the web content as the prompts and the categories as the completions. With that many categories, I’m not sure if it would work or how many examples you might need.