With hundreds of questions open on the general topic of “searching” I’m hoping OpenAI can share some insight into how search generally works so that we can set expectations and refine our approaches - not just in a response here, but as an ongoing policy.
For example:
When we ask ChatGPT to “search the web”, what is its exact order of operations?
- Search user chats first, if available?
- Reduce search concepts expressed in prompt to minimal keywords?
- Send query to Bing? Google? Local OpenAI storage?
Knowing searches are by keyword and not semantic compels a user to spend less time explaining what they want so that they can focus on their own keywords. If searches are semantic, how can we get the distilled search query that a model might create so that we can better refine a prompt? If we know that a search for “best BBQ in Texas” is reduced to “good rib restaurants” then we can take action to refine the search. Without that information we don’t know the basis for the search results and thus the basis for an assistant response based on those results.
Where exactly does OpenaI get its search data? How often is the web polled for freshness? If keyword searches are sent to Google and then cached, maybe we should be doing our own searches and returning our own refined results to an assistant for actual processing.
Knowing the engine facilitates choices. For example: I am not fond of Bing and prefer Google. I have no idea what OpenAI does but I suspect there are fewer results retrieved and cached. So I’d preter the assistant use a Google search via SERP or some other API. But maybe OpenAI queries both engines and others and de-dupes and caches. With no knowledge of the processes I can’t make informed choices. I can only hope that search results are current, sufficiently diverse, and sourced from a rich pool of data.
If local searches performed first add bias to internet queries then we’re facing the same problem we all have with search results that tend to reinforce a world view based on prior searches. I don’t want that status quo, I want data from the board pool of internet content with bias-free filtering for what I asked for, not for what some entity thinks I want, or what some entity prefers that I see.
Related : If there’s no local influence or bias, the exact same prompt from any two people should return fairly determinate results. This won’t be the case if OpenAI is sourcing from localized engines, but it might be the case if OpenAI sources and de-dupes from multiple localized engines (google.co.uk, google.in, bing.de, bing.com.au …).
It would help to have some insight into how OpenAI chooses (or doesn’t choose) data sources for live queries or storage.
We actually know that there is some bias in searches because we are told in Settings that user Memory can or will be used in searches. How can a user see the exact search engine queries that have been posted to remote search engines, and the exact responses that came back?
This is important : someone doing a search for medical information for a family member will get results based on their own personal memory data. The average user might just ask “what’s a good cure for a cold” without providing the imporant context that they’re looking for someone else. Knowing how searches are performed and the data sent and received, can be critical for a significant scope of use cases - personal, business, medical, technical, etc.
We can modify our ChatGPT searches with syntax like “site:example.com”. Does this tell us that ChatGPT uses Google? Or does it tell us that OpenAI has replicated this functionality for more specific searches? What are the limits to this syntax? Why do we need such syntax if we just specific the example.com site in a prompt? Should we be aware of other syntactical tricks like this (from Google or Bing documentation) to modify how ChatGPT does searches?
Are answers to these questions the same for the OpenAI web search API?
When a reasoning model searches and pulls back responses, then considers the data, how do we know what data is being used by the model behind the scenes to formulate its final response?
Do answers to any of these questions change with the model or based on a model’s training cut-off date?
Will Projects be enhanced with search queries processed to include project-specific instructions? This isn’t a question about future development, it’s intended to get OpenAI to include consideration and documentation for such things in all such updates.
I understand that this is a long note and that all questions can’t be answered here or perhaps elsewhere. What I’m trying to do is to establish a base for transparency on this topic. I’d like OpenAI to be more aware that the mystery of this significant component can be as much a liability as an asset.
Content-seeding for bots and web scrapers isn’t a widely recognized thing yet. But it will be soon. SEO and search engines have a love/hate relationship regarding keywords and phrases can, should, and should not affect processing. There are and will be more websites that seed their content specifically for ChatGPT and other user assistants, for their own purposes, sometimes nefarious. Imagine meta tags in web pages, that direct people to harm themselves, being scraped by ChatGPT to process a self-help prompt. We are there. But we do not have any insight into what happens with our searches to know how to handle this.
Let’s be proactive.
OpenAI - please help the developer community to understand how this works so that we can create better client-side tooling. Help us to reduce the chances of bad things happening when we accept text from average human beings, send it to you for processing, and then return text that we can only hope won’t get us all into trouble. And let’s help ChatGPT users to understand how this works too.
Thanks.