Retrieval from Azure Index (as Vector Database)

Hi All,

I have been working on Azure + OpenAI integration and how we can use Azure Index as an large scale vector database.

My Use case:

  • Take pdf documents as input (doc size >= 25 pages)
  • Divide each document into suitable chunks (used langchain)
  • Embed each chunk (used openai embedding model)
  • Upload each chunk embedding onto Azure AI Index

The structure of the Azure Index is:

{
id: "----", (string)
title: "title-of-the-document", (string)
chunk:"content-of-chunk", (string)
chunk_embedding:"chunk-embedding" (SingleCollection)
}

Now, I have stored each chunk with its embedding for each document on the index.

In the retrieval step:

  • I filter the chunks w.r.t document title.
  • To fetch the top k relevant chunks based on the user query, I do am using the search approaches like Vector Search, Hybrid Search etc. which is not giving a proper top k chunks that I can further use as my context knowledge.
  • I have also tried using NumPy cosine similarity search, but that is also not satisfactory.

Also, size of the document is the reason I can not directly upload the document and its embedding since it exceeds the context limit.

Does anyone have any suggestion or approach by which I can:

  • Use content (chunks, embedding) and user query (content, embedding) to retrieve a proper top k chunks that could then be used as context knowledge.

Thank you