I'm building a multilingual chatbot (Italian, English, Spanish, etc.) that acts as a travel consultant for a specific city, using semantic caching with a vector database to reduce LLM API costs and latency.
## Current Architecture
Cached responses are stored with embeddings and language metadata:
```python
# English entry
{
"embedding": [0.23, 0.45, ...],
"metadata": {
"question": "what are the best restaurants?",
"answer": "The best restaurants are: Trattoria Roma, Pizzeria Napoli...",
"language": "en"
}
}
# Italian entry
{
"embedding": [0.24, 0.46, ...],
"metadata": {
"question": "quali sono i migliori ristoranti?",
"answer": "I migliori ristoranti sono: Trattoria Roma, Pizzeria Napoli...",
"language": "it"
}
}
The Problem
Since embeddings are semantic, “best restaurants” (English) and “migliori ristoranti” (Italian) have very similar vectors. Without proper filtering, an Italian user asking “ristoranti” might get the cached English response.
My current approach: Filter vector search by language metadata:
results = vector_db.query(
embedding=embed(user_message),
filter={"language": user_language},
top_k=1
)
This works IF I can reliably detect the user’s language. But:
-
Messages are often very short (“museums”, “metro”, “parking”)
-
Language detection libraries (langdetect, fastText) are unreliable with < 20 characters
-
The chatbot is stateless (no conversation history for caching efficiency)
-
Platform is WhatsApp (no browser headers available)
What’s the recommended semantic caching strategy for multilingual chatbots when user language cannot be reliably detected from short messages?