Web-Crawler Tutorial (OpenAi) 403

Lot of things I’m not understanding
First of all: the code runs well with the following: (I added spaces so i could post links)

Define root domain to crawl

domain = “openai . com”
full_url = “ht*ps: // openai. com”

I get a list of the files its crawling but many say “can’t parse”. Why is that?

secondly, if i change the code:

Define root domain to crawl

domain = “mysite. com”
full_url = “ht*ps: // mysite. com”

then I get a straight 403 error. Why?

Thank you

Welcome to the forum.

Do you have a link to the tutorial? Some code?

Not all sites will allow you to scrape them…

Thanks Paul!

The tutorial was the one on the OpenAi website
https://platform.openai.com/docs/tutorials/web-qa-embeddings

I hope this can help you: