Consistent `Connection error` when using LlamaIndex w/RAG

Hi!
I’m hoping I could get some help on my problem.
I’ve managed to build a chat engine using RAG with a simple directory reader & a PG Vector Store.
When asking questions, in a back and forth way (chat engine style), there’s a very strange but consistent behavior.
When I send a first message, I get an answer from OpenAI. But when I send a second message, I run into Connection errors:

INFO:     Loading index from storage...
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:     Finished loading index from storage
INFO:llama_index.core.chat_engine.condense_plus_context:Condensed question: <condensed_question>
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
/.venv/lib/python3.11/site-packages/vecs/collection.py:502: UserWarning: Query does not have a covering index for IndexMeasure.cosine_distance. See Collection.create_index
  warnings.warn(
INFO:     127.0.0.1:59430 - "POST /api/chat/ HTTP/1.1" 200 OK
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:     127.0.0.1:59442 - "POST /api/chat HTTP/1.1" 307 Temporary Redirect
INFO:     Loading index from storage...
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:     Finished loading index from storage
INFO:openai._base_client:Retrying request to /chat/completions in 0.928694 seconds
INFO:openai._base_client:Retrying request to /chat/completions in 1.522838 seconds
INFO:openai._base_client:Retrying request to /chat/completions in 3.389680 seconds
ERROR:root:Error in chat generation: Connection error.
INFO:     127.0.0.1:59442 - "POST /api/chat/ HTTP/1.1" 500 Internal Server Error

I set up my chat engine the following way:

def get_chat_engine():
    model = os.getenv("MODEL")
    llm = OpenAI(model, temperature=0)
    memory = ChatMemoryBuffer.from_defaults(token_limit=10000)

    return get_index().as_chat_engine(
        similarity_top_k=3,
        memory=memory,
        chat_mode="condense_plus_context",
        llm=llm,
        verbose=False,
    )

With get_index() defined the following way:

def get_index():
    # check if storage already exists
    if not os.path.exists(STORAGE_DIR):
        raise Exception(
            "StorageContext is empty - call 'python app/engine/generate.py' to generate the storage first"
        )
    logger = logging.getLogger("uvicorn")
    # load the existing index
    vector_store = get_vector_store()
    logger.info(f"Loading index from {STORAGE_DIR}...")
    storage_context = StorageContext.from_defaults(
        persist_dir=STORAGE_DIR, vector_store=vector_store
    )
    index = VectorStoreIndex.from_documents(
        documents=get_documents(), storage_context=storage_context
    )
    logger.info(f"Finished loading index from {STORAGE_DIR}")
    return index

And I’m calling the OpenAI using Streaming mode:

@retry(stop=stop_after_attempt(5), wait=wait_fixed(3))
async def call_openai_api(
    chat_engine: BaseChatEngine, message: _Message, messages: List[_Message]
):
    try:
        response = await chat_engine.astream_chat(message, messages)
        return response
    except Exception as e:
        print(f"Error in API call: {e}")
        raise

MODEL=gpt-3.5-turbo-0125

It’s been very consistent and systematic and I don’t understand why it happens. A short term solution would be to reboot the server but it’s definitely not sustainable…
Would anyone know why?

EDIT:

Adding request IDs:

— First communication —

  • Successful embedding requests:
    • req_f5453dc74ec0972731cd922c6548a00d
    • req_b600dce8f70d7e5e0919cfab235bbb9b
  • Successful completion request:
    • req_4f7441a7793c496a4dd9bdfb8b62a9fe

— Second communication —

  • Successful embedding request
    • req_ae6ff57945d69de09caf0a2d1a05d062
  • Failed completion request
    • no request id