How does the new web search model and tool determine which sites to look into?
I tried it for fetching news agenda. Here are my observations:
When no specific site is mentioned in the message, it does return results from a limited set of news sources, but fully annotated. Apparently these sources are favored by the model.
When I include one of the favored sources but also ask news from others as well, only news from that one favored source is returned.
When none of the mentioned sources are favored, it returns results, presumably from these sites, but hard to validate as this time no annotations are provided.
This makes me think that there should be a standard the agent is looking for, like the good old sitemaps, robots.txt, etc. but for LLMs, and only its early adopters are favored. But I couldn’t find any mention of such a thing in references.
Do you know how to get “indexed” by the web search tool? And is it possible to get the tool search any web source effectively, providing reference and annotations?