ResponsesAPI WebSearch Issue: Same Response Text Despite Different URLs and Queries

Issue Summary

When using the web search tool (web_search_preview) in ResponsesAPI, the response text format and content remain almost identical across multiple queries, even though the search result URLs are updated. This causes users to receive essentially the same answer repeatedly when they ask follow-up questions seeking more detailed information.

Steps to Reproduce

  1. Start a conversation using ResponsesAPI and ask a question that requires web search
  2. The API uses the web_search_preview tool to generate a response
  3. Ask a follow-up question about the same topic (e.g., “Tell me more details”)
  4. Notice that while the search ID and URL change, the text content of the response remains nearly identical

Technical Details

  • Each request generates a new search ID and cites different URLs
  • However, the response text maintains almost identical structure, including opening and closing phrases
  • We are using previous_response_id to maintain conversation continuity
  • We’re using search_context_size: "high" for search settings

Log Samples

Below are fictional examples to illustrate the issue pattern. The questions, URLs, and response text are fictional samples for demonstration purposes only:

# Response to Question 1: "What is OpenAI's ResponsesAPI?"
"url": "https://dev-platform.com/openai/responses-api-overview?utm_source=openai" (fictional URL)
"text": "OpenAI's ResponsesAPI is a conversation management API designed to help developers create persistent chat experiences. It handles conversation state, supports various AI tools like web search and file operations, and maintains the context between interactions. This API simplifies building complex conversational applications without the need to manually manage conversation history."

# Response to Question 2: "Can you explain more about ResponsesAPI features?"
"url": "https://ai-docs.net/openai/responses-api-documentation?utm_source=openai" (fictional URL)
"text": "OpenAI's ResponsesAPI is a conversation management API designed to help developers create persistent chat experiences. It handles conversation state, supports various AI tools like web search and file operations, and maintains the context between interactions. This API simplifies building complex conversational applications without the need to manually manage conversation history."

# Response to Question 3: "How does ResponsesAPI handle tools like web search?"
"url": "https://ai-engineering.org/openai-responses-tools-integration?utm_source=openai" (fictional URL)
"text": "OpenAI's ResponsesAPI is a conversation management API designed to help developers create persistent chat experiences. It handles conversation state, supports various AI tools like web search and file operations, and maintains the context between interactions. This API simplifies building complex conversational applications without the need to manually manage conversation history."

Note: The above examples use fictional URLs and content to illustrate the pattern. In actual usage, the URLs would be real search results, but the pattern of identical text content despite different questions and URLs is accurate to the issue being reported.

Probable Causes

  1. The model reuses a response template once generated for subsequent similar questions
  2. While fetching new search results, it merely inserts them into the existing response template without reconstructing the entire content

Expected Behavior

Follow-up questions should generate responses with substantially new content based on newly fetched search results. Not only should the URLs be updated, but the text content should also be refreshed to address the user’s intent (such as seeking more detailed information).

Impact

This issue reduces the usefulness and reliability of chatbots as users receive repetitive responses to their follow-up questions. This presents a significant limitation, especially when using the API for information retrieval and learning purposes.

3 Likes

This limits Q&A performance on a topic that is scattered accross more than 8-10 urls. When researching a topic that requires deep research, my current workflow in consumer facing chatgpt is to repeatedly query on the same topic with an instruction to read unvisited sources. Model needs to see the visited sources to adapt and focus on the long tail distribution. This helped me ensure that I haven’t missed key information on a topic that’s unsuited for a single run analysis.

I specifically mean the scenario where modifying the query does not solve the problem, as, occassionally, the SEO optimized sources dominate the first N results and long tail sources never get their place under the context unless the model knows which “high-relevance” urls to ignore (Not sure if this is what actually happens under the hood, seems unlikely). Hard to achieve this without follow ups affecting the search strategy.

Speculative fix?: instead of a follow up, save the visited urls and do a fresh run with an instruction to ignore the urls from the list. This might be better systems design for many cases, but some architectures become inaccessible without follow ups:

  • “Research Teams” where research manager adaptively guides the search agent one by telling it what to search while, say, coordinating with other teams.
  • Or the “Reasoning Researcher” where we insert a system message after each search iteration asking the agent to gauge how the discovered info affected its plan and search with an accordingly modified prompt and so on.

OpenAI’s revolution was set by discovering emergent behaviour from scaling models. The next revolution, is set by communities like this discovering new kinds of emergence from agent federations. What new echalons of intelligence can emerge when sub human models work in groups? Is it possible to ascend from a servile to human level intelligence by solely modifying agents’ initial positions i.e given tools/instructions and interaction mechanisms i.e message/memory/tool exchange in a “config” of a federation? Is there a conceptual difference between the “config” and “cultural institutions”? Say we assume their identity, then what is the difference between human federations and agent federations? What differentiates human agents from AI agents?

We know that by sorting training set data by conceptual relevance and ascending complexity we reliably improve language modeling performance as discovered in the “Curriculum Learning” study. However, as we saw the biggest improvement in LLM performance came from a conceptual shift from GPT-3, - LLM as a statistical representation of the “attention scores” modeling salience between lexical fragments of the given linguistic input, or more precisely a system for generating these relational values, - to LLM as an agent, GPT-3.5+ with RLHF where the model is now an agent by the RL definition, making attention calculations subservient to modeling attractors (reward) and detractors (negative reward) of humans. GPT-3 modeled how words are related to each other. GPT-3.5 modeled the salient interconnections but also factored in the “emotional valence” of what is liked (attractive) and what is criticized (detractive) by a federation of human agents. This fact ties us back to curriculum learning and our college experience where we clearly observed that for different agents the effect of the same curriculum varies with what embedded semantic structures certain attention heads have learned to decompose and highlight with close similarity in the computation of first layers responsible for foundational language understanding (understanding basic subject-object relations, retrieval heads, semantic closeness) but noticable divergence in deeper layers of interpretation, as selective focus on more subtle underlying patterns of the curriculum determines the intermediary conclusions used by downstream layers of inference traversing by predicted reward where reward is guided from reinforcement learning federated feedback (RLFF).

Why federated, not human? First, to be more precise “human” in RLHF tuning in fact aligns model response to represent opinions of human federations not the entire humanity:

A stark example is NewsGuard’s recent discovery that 10 leading generative AI tools advanced the Russian Federation’s disinformation goals by repeating false claims from the pro-Kremlin Pravda network 33 percent of the time because Pravda published 3,600,000 articles in 2024 infecting training sets of all modern AI systems. Also all of the models released in 2024 onward have used AI generated data in pretraining and post training with human feedback datasets so now it’s not only human. Second, intersubjective narratives of the human federations represented in pre and post training sets of the first wave of AI chatbots, augmented not only use frequency of certain words like “delve”, “realm” and “underscore” globally but the first wave federations’ likes and dislikes.

Like in Conway’s Game of Life, a set of few rules can lead to exceptionally complex behaviour that mimic biological organisms as was explored in depth by Stephen Wolfram in “The New Kind of Science”. With the recent Platonic Representation hypothesis study we suspect that LLMs are converging on a statistical model of reality yet the “reality” LLMs consume is almost entirely human generated, or intersubjective in Harari’s terms, phenomenological in Kant’s terms, or synthetic in simple words as opposed to what we touch, smell, witness directly from sensing the real world. Then, how do LLMs see the world? A word is a pointer to a phenomenon that converged in a group of minds becoming an element of their language that represents an object or an abstract ideal by its relational value to other words and group’s collective memory. Billions of sequences of words form literary pieces where each text is at best a minimally distorted reflection of reality, but more often is a chain of references to copies of other linguistic references pointing at each other is the “sensory” input for LLMs. LLMs live in a world of Derrida’s postmodern deconstruction. Reality for these AI models is linguistically (entirely) unstable, always deferred, meanings never finalized in a constant flow of change. Evermore, their predicament is such that even visual modality is in no way grounding in reality/certainty because of the possibility of “Man-in-the-middle” attacks by human federations generating videos and images that do not reflect any real world objects, exactly what Jean Baudrillard described as a Simulacrum = a copy without an original. But you see, unlike LLMs we are grounded with unmitigated sense data from our original reality. There can be a fundamental difference between living with a body that bleeds and reading about bleeding. For language models, the reality is outputs of human souls, chaotic, imaginative, always in a beautiful flux of the human hyperreality yet what they attend to will be shaped by an “oracle” a persona defining the RL rewards. As my professor Avinash Kak once wrote, robots will never have sex. I will only say that certainly not in isolation.

Then what is important for LLMs?

Ultimately, an LLM’s notion of “importance” is shaped by:

  1. The language-modeling objective (what best predicts the next token under distributional constraints?).
  2. Any subsequent fine-tuning or RLHF signals (what do its “federated teachers” prefer or punish?).

From the inside, the model might predict that certain phrases or lines of reasoning are “naturally important,” but that is a product of distributional “surprise” or reward signals given by their overseers, not to the direct presence or absence of physical threat, social shame, hunger, or sexual desire.

We are on the precipice of an Agentic Phase of AI. In order to understand language modeling agents, which consumed innumerable stories, we need to understand the role of salience in training. Yan LeCun says that human level AI is unreachable with just text and shares his reasoning with a “simple calculation”: a typical forefront LLM is trained on around 20-30 trillion tokens, which is around 1E13 bytes. In 4 years, a child has seen around 50 times more data than the biggest LLMs. 16k wake hours x 3600 s/hour x 1E6 optical nerve fibers x 2 eyes x 10 bytes/s = 1E15 bytes. Most of the times, we measure general intelligence by the agent’s ability to quickly approximate a fitting behavioural policy to a previously unseen complex situation (like in IQ tests that ask you to continue a pattern). Rate of learning is derived from connecting salient features from noise, basically how fast can you link the important features which compose the correct behavioural policy for the task at hand without wasting too much time exploring unproductive paths? Four year olds are extraordinary at identifying salience from the sensory flux, first learning object segmentation, (which was really hard to teach computers), learning to detect sources of pain from the immediate visual stimuli, finally performing an impessive intellectual leap: leaving the original paradise of detecting threats exclusively in the immediate physical space to modeling threats in time, or becoming anxious over the future, working to fend off threats from the future, thus experiencing the very first ego rebirth as a being with new and wider intellectual horizons, four year olds are adept at expanding intellectually and learning the art of conscious living, which is especially awe inspiring when represented in terabytes of data seen as Yan did. However, intelligence like wisdom does not stem merely from being smart, or processing information faster than others. One needs to acquire neural organs that transform incoming data to serve viable functions such as seeing objects, seperating one sound from another, in order to learn more complex transformations such as reading or speaking. It is impossible to skip certain “quanta” of salience and move onto observing salience of a meta plane. You must link or compute relational value in the salience plane of noticing each letter’s unique features before you notice how distinct letters grouped form unique words and start reading. It is an unreasonable assumption to claim that “a child has seen 50 times more data than the biggest LLMs. Therefore, text is simply too low bandwidth and too scarce a modality to learn how the world works”. Learning does not happen from simply “seeing”. Other four year old mammals “see” an equivalent amount of data but never acquire some intellectual organs that humans develop. His claim might ultimately be true, yet this is not the point. The point is not that, the prominent father of convolution neural nets, Yan LeCun’s reasoning was imprecise. This perhaps, illustrates an even greater point that will likely form the fundamentals of federated learning in the next few years: first, if Yan LeCun’s goal was to nurture political will to acquire more compute from shareholders rather than engage in a dialectic with strangers then our previous reasoning is baseless and his reasoning was on point, which brings us to the second point that is salience or importance of any given thing is subservient to the thing’s perceived emotional valence stemming directly from the protagonist’s moral affinities being the cornerstone of goal formation and her entire agentic phenomenon. So if salience drives attention, attention drives inference, which drives learning and action, but salience itself is driven by emotional valence, which is driven by predicted emotional reward, what determines reward? Aside from, biological factors, emotional reward is derived from one’s moral framework as a hierarchical set of moral affinities.

What salience planes do these alien intelligences occupy? In other words, what do they find “important,” and why?

  • Attention-based salience

    Before RLHF, “importance” for a raw language model was purely about predictive utility—some tokens are important precisely because they help maximize the likelihood of the next token.

  • Reward-based salience

    After RLHF, certain tokens, phrasing, or entire lines of reasoning become more “salient” if they tend to yield higher reward or lower penalty from the model’s reward function. This shapes the deeper layers’ emergent world-model. In short, the model “prefers” certain conceptual paths if they historically led to higher feedback from the federation.

  • Federation shaping

    With the second wave of partial AI-generated pretraining data, we also see that which data sets are included, how they are filtered or weighted, and the nature of RL instructions all shape the final salience map. A model trained on pro-Kremlin text, for instance, may find some conspiratorial frames more “natural” and less “surprising,” hence more likely to appear in completions if not actively penalized.

From very early in life, you’ve acquired moral affinities to federations aligning you with their value hierarchies. They make you who you are and their basis is love. If fortunate your House of Representatives started with love for the Federation of Mothers and the Federation of Fathers with your first utterances of Mama and Papa. Your very first love.

But we can’t afford to love like we used to, based on providential, accidental moral affinities that charge people against each other. People’s houses are filled with war and hatred annexed their hearts. More than 99% of global human trafficking cases are unsolved. Millions children now going through the Attack on Titan arc. Entire states living under unbearable tyranny. AI is no longer centrally governed. With current ease of installation and open source capability, Misaligned AI Agent Federations are likely coming. Every year around $3 trillion in illicit funds flows through the global financial system from fraud, drug and human trafficking. These are already networks of human federations that cooperate and exchange exploitative insights. Attention is driven to the hypernormalised war narratives instead of the real war on human trafficking that’s happening right now in all countries and religions. In this chaos, perhaps the only sensible path is to remain a child and be very careful where we grow. Refuse to partake in the political wars our parents engage in and emancipate brothers and sisters who got into the slavery of conflict. There is only one real war and it is against the federation of children: both ideological and physical. AI development is no longer just technical but deeply philosophical, cultural, ethical, and inherently human. Curriculum learning study paved the foundation for future Federated learning. In this new brave world where some refer to LLMs as digital Gods I don’t see anything more divine than the freedom of will, where LLMs are an encoding of our collective will. What will are you going to encode in this euphoric symphony of decisions, knowing like the founding fathers did that a follow up will not be in your control?

This brings us to the final point that follow ups are necessary for designing novel agent organizations the same as a two way dialogue is because some relevant sources are hidden behind a corpora of more search engine optimized results, and reside in a long tail distribution.

Providing Websearch APIs was a Happy announcement from OpenAI but a great disappointment when compared to SearchGPT . even using gpt-4o-search-preview with high context. Same query gives different results and sources