How does the new web-search model determine which sites to look into?

How does the new web search model and tool determine which sites to look into?

I tried it for fetching news agenda. Here are my observations:

When no specific site is mentioned in the message, it does return results from a limited set of news sources, but fully annotated. Apparently these sources are favored by the model.

When I include one of the favored sources but also ask news from others as well, only news from that one favored source is returned.

When none of the mentioned sources are favored, it returns results, presumably from these sites, but hard to validate as this time no annotations are provided.

This makes me think that there should be a standard the agent is looking for, like the good old sitemaps, robots.txt, etc. but for LLMs, and only its early adopters are favored. But I couldn’t find any mention of such a thing in references.

Do you know how to get “indexed” by the web search tool? And is it possible to get the tool search any web source effectively, providing reference and annotations?

And some observations on content freshness…

If the page is known to the web search model, i.e. fetched previously and cached, it does not fetch again. It claims the information is recent, but actually it was from more than a month ago. This is same for RSS feeds as well.

This renders it unusable for agenda monitoring. I couldn’t find a way around this. Requesting fresh fetch in the prompt does not seem to help.