Consistent `Connection error` when using LlamaIndex w/RAG

I’m hoping I could get some help on my problem.
I’ve managed to build a chat engine using RAG with a simple directory reader & a PG Vector Store.
When asking questions, in a back and forth way (chat engine style), there’s a very strange but consistent behavior.
When I send a first message, I get an answer from OpenAI. But when I send a second message, I run into Connection errors:

INFO:     Loading index from storage...
INFO:httpx:HTTP Request: POST "HTTP/1.1 200 OK"
INFO:     Finished loading index from storage
INFO:llama_index.core.chat_engine.condense_plus_context:Condensed question: <condensed_question>
INFO:httpx:HTTP Request: POST "HTTP/1.1 200 OK"
/.venv/lib/python3.11/site-packages/vecs/ UserWarning: Query does not have a covering index for IndexMeasure.cosine_distance. See Collection.create_index
INFO: - "POST /api/chat/ HTTP/1.1" 200 OK
INFO:httpx:HTTP Request: POST "HTTP/1.1 200 OK"
INFO: - "POST /api/chat HTTP/1.1" 307 Temporary Redirect
INFO:     Loading index from storage...
INFO:httpx:HTTP Request: POST "HTTP/1.1 200 OK"
INFO:     Finished loading index from storage
INFO:openai._base_client:Retrying request to /chat/completions in 0.928694 seconds
INFO:openai._base_client:Retrying request to /chat/completions in 1.522838 seconds
INFO:openai._base_client:Retrying request to /chat/completions in 3.389680 seconds
ERROR:root:Error in chat generation: Connection error.
INFO: - "POST /api/chat/ HTTP/1.1" 500 Internal Server Error

I set up my chat engine the following way:

def get_chat_engine():
    model = os.getenv("MODEL")
    llm = OpenAI(model, temperature=0)
    memory = ChatMemoryBuffer.from_defaults(token_limit=10000)

    return get_index().as_chat_engine(

With get_index() defined the following way:

def get_index():
    # check if storage already exists
    if not os.path.exists(STORAGE_DIR):
        raise Exception(
            "StorageContext is empty - call 'python app/engine/' to generate the storage first"
    logger = logging.getLogger("uvicorn")
    # load the existing index
    vector_store = get_vector_store()"Loading index from {STORAGE_DIR}...")
    storage_context = StorageContext.from_defaults(
        persist_dir=STORAGE_DIR, vector_store=vector_store
    index = VectorStoreIndex.from_documents(
        documents=get_documents(), storage_context=storage_context
    )"Finished loading index from {STORAGE_DIR}")
    return index

And I’m calling the OpenAI using Streaming mode:

@retry(stop=stop_after_attempt(5), wait=wait_fixed(3))
async def call_openai_api(
    chat_engine: BaseChatEngine, message: _Message, messages: List[_Message]
        response = await chat_engine.astream_chat(message, messages)
        return response
    except Exception as e:
        print(f"Error in API call: {e}")


It’s been very consistent and systematic and I don’t understand why it happens. A short term solution would be to reboot the server but it’s definitely not sustainable…
Would anyone know why?


Adding request IDs:

— First communication —

  • Successful embedding requests:
    • req_f5453dc74ec0972731cd922c6548a00d
    • req_b600dce8f70d7e5e0919cfab235bbb9b
  • Successful completion request:
    • req_4f7441a7793c496a4dd9bdfb8b62a9fe

— Second communication —

  • Successful embedding request
    • req_ae6ff57945d69de09caf0a2d1a05d062
  • Failed completion request
    • no request id