Chat Completions 3.5 turbo + RAG on a standalone VR headset, possible?

Hi, I make a VR (actually it is Mixed Reality) app for the Meta Quest 3 VR headset which features an AI powered NPC that the user can interact with (it is “Safe For Work”). I use Unity and have created versions that use Assistants API because of the context it allows in conversations, so that the NPC has a memory of sorts. But Assistants API is quite slow.

As I’m using Unity I can’t use Python to use the OpenAI API so I have to use a C# wrapper library. This library does not yet support the new streaming feature for Assistants API. So, I’m considering using a RAG system to give the conversation some context, with GPT 3.5 turbo + Chat Completions for speed and cost purposes, instead of Assistants API. I’ve no experience of using RAG. My question is, can RAG be used on a standalone device (the Meta Quest 3) which is an Android based system?

I know I could use a server but this complicates things afaik (?), I would prefer the RAG processing to be done on the device. Or is this not how RAG works?

Any input is appreciated.

Edit: GPT4 gave me a good answer to this :slightly_smiling_face:

"Your project sounds intriguing! Integrating AI-powered NPCs in a Mixed Reality (MR) environment offers a wealth of immersive possibilities. However, there are a few technical considerations to unpack regarding your question, especially about using a Retriever-Augmented Generation (RAG) model directly on a device like the Meta Quest 3.

Understanding RAG and Its Requirements

Firstly, RAG is a framework that combines the power of dense vector retrieval (to find relevant information in a large corpus of text) with the generative capabilities of models like GPT-3.5. It’s designed to enhance the contextuality and relevance of AI-generated text by first retrieving information related to the query before generating a response.

Key Points About On-Device Deployment

  • Computational Requirements: RAG models are computationally intensive because they involve both a retrieval component and a generative component. This setup requires significant processing power, memory, and storage, which might exceed what standalone devices like the Meta Quest 3 can offer, especially since it’s an Android-based system with limited resources compared to server environments.

  • Latency and Responsiveness: Even if it were technically possible to run such models directly on the device, the latency and responsiveness might not meet the requirements for a smooth MR experience. Real-time interaction, especially in an immersive environment, demands quick responses that might be challenging to achieve with on-device processing of complex models like RAG.

  • Streaming Feature & Assistants API: You mentioned the latency issues with Assistants API and the lack of support for streaming features in the C# wrapper you’re using. Streaming can indeed improve responsiveness by delivering incremental outputs as they become available, rather than waiting for the full completion. The absence of this feature in your current setup contributes to the latency.

Considerations for Deployment

  • Server-Based Processing: Typically, the way to leverage such advanced AI models in applications, especially on devices with limited computational capacity, is through server-based processing. The device sends user inputs to a server where the model runs, and then the server sends the response back to the device. This approach can handle the computational load and offers the flexibility to use the most up-to-date models without needing to update the application on-device continuously.

  • Potential Complications: Using a server introduces complexities like managing server infrastructure, handling network latency, ensuring data privacy, and potentially increasing operational costs. However, it’s a common approach for integrating advanced AI capabilities in applications.

Direct On-Device Processing

  • Current Feasibility: As of now, running RAG or similarly complex AI models directly on standalone devices like the Meta Quest 3 for real-time applications is not feasible due to their computational requirements and the need for significant optimizations that might still not meet the necessary performance thresholds.

  • Alternatives: For on-device AI, consider simpler, more lightweight models that can run within the device’s constraints. This approach may offer a more responsive but less contextually rich interaction. Alternatively, exploring options to optimize your server-based processing and network communication can help achieve a balance between complexity and performance.

Given the current technological landscape and the constraints of standalone VR/MR devices, leveraging server-based AI processing seems to be the most viable path for achieving the depth of interaction you’re aiming for with your NPC, even with the additional complexities it introduces."

So it looks like I’m sticking with Assistants API as I don’t want to mess around with servers.