Variability in RAG Performance Across Different Environments

We are implementing a Retrieval-Augmented Generation (RAG) architecture using Langchain, Qdrant, and OpenAI’s GPT-3.5-turbo 16k. However, we’ve observed that the performance of our RAG system varies significantly across different environments, such as Predev, Dev, QA, and PROD.

Has anyone encountered similar issues, and are there any recommendations for addressing and overcoming these performance inconsistencies?

Welcome to the community!

No, not really - but if I did, I’d assume it’s probably an SDLC design issue. If you test the same things, how can the performance go up or down? If you don’t test the same things, what are you really testing, and why?

What do you think could be happening?

RAG should be more or less deterministic. Prompt results should be convergent. You should be able to design your systems (even involving LLMs) to be predictable and testable to a high degree.