Bug report: Incorrect Dutch-German language mixing during Vector Store query rewriting

Summary

When using the Vector Stores Search API with rewrite_query=true, the query rewriting feature incorrectly mixes Dutch and German languages, producing nonsensical hybrid queries that break semantic search functionality.

Environment

  • API: OpenAI Vector Stores Search API

  • Endpoint: POST /v1/vector_stores/{vector_store_id}/search

  • Parameter: rewrite_query=true

  • Language: Dutch (nl-NL)

Bug Description

The query rewriting feature is translating portions of Dutch text into German while retaining other Dutch words, creating invalid hybrid queries that don’t exist in either language.

Expected Behavior

When provided with a Dutch query, the system should either:

  1. Keep the query in Dutch for optimal semantic search

  2. Translate the entire query to English if optimization requires it

  3. At minimum, maintain linguistic consistency within a single language

Actual Behavior

The query rewriting produces a mixture of Dutch and German that is grammatically incorrect and semantically confusing in both languages.

Reproduction

Original Query (Dutch)

Wie fietst of loopt vaker?

Translation: “Who cycles or walks more often?”

Rewritten Query (Incorrect Dutch-German Hybrid)

Wer fietst vaker dan loopt?

Issues:

  • “Wer” is German (should be “Wie” in Dutch)

  • “fietst” is Dutch (correct)

  • “vaker dan” is Dutch (correct)

  • “loopt” is Dutch (correct)

  • The grammar is broken: mixing German question word with Dutch verb conjugations

Code to Reproduce

from openai import OpenAI

client = OpenAI()

search_result = client.vector_stores.search(
    vector_store_id="vs_xxx",  # Your vector store ID
    query="Wie fietst of loopt vaker?",
    max_num_results=10,
    rewrite_query=True
)

print(f"Original query: Wie fietst of loopt vaker?")
print(f"Rewritten query: {search_result.search_query}")
# Expected: Dutch query or English translation
# Actual: "Wer fietst vaker dan loopt?" (German-Dutch hybrid)

If any more information is needed to reproduce or fix this, let us know.

3 Likes

This was something interesting to approach: I ask myself: Can the rewriter follow instructions in the input (likely text placed in its own prompt), and separate them from and not damage the query?

Answer: Yes.

You gave the language code. That’s an interesting thought, something the API could provide as a parameter to guide the rewriter.

Question: Will you know that language or code as an input, or do you know the language of the documents?

With the language ISO code being known, or even the best language for semantic search to be written in, here’s how I approached that, using a Python library to map back code to the language string.

def query_in_language(query: str, language: str = "nl-NL") -> str:
    """
    Return an instructional string for the query rewriter that fixes the language
    of the rewritten query to the specified language. Adapt the return for best results.

    Example:
        query_in_language("penguin brain size", "en-US")
        -> "Rewrite using English language (en-US): `penguin brain size`"
    """
    from langcodes import Language
    query = query.strip()
    if not query:
        raise ValueError("query must be a non-empty string")

    tag = (language or "en-US").strip()
    lang = Language.get(tag)
    language_name = lang.language_name("en")
    display_tag = lang.to_tag()

    return (f"Rewrite and extend this semantic search query in {language_name} language, "
        f"targeting ({display_tag}) documents: `{query}`")

Or you could just write the output of the function yourself.

>>>query_in_language(query, language_hint_code)

`'Rewrite and extend this semantic search query in Dutch language, targeting (nl-NL) documents: `Wie fietst of loopt vaker?`'

Returns from the vector store search API:
"search_query": ["Wie fietst er vaker, en wie loopt er vaker?"]

AI says about the rewrite, since I don’t speak the language:

This is:

  • 100% Dutch, no German contamination.

  • Grammatically correct:

    • Wie (who)
    • fietst er vaker / loopt er vaker – idiomatic Dutch use of “er”.
  • Semantically:

    • Your original was “Who cycles or walks more often?” (single question with of).
    • The new version is “Who cycles more often, and who walks more often?”

If AI is involved in making the query, you can use that required instructional prefix as part of the instructions of your function, or simply instruct that the query must be lengthy enough so there is no ambiguity about which world language it is.

Afterthought: The query is run against AI embeddings - and I wonder how much any bad language would actually damage the ranking of semantic search returns. The rewriter doesn’t do much in the way of HyDE, simulating document text.

1 Like