Hi All,
I have been working on Azure + OpenAI integration and how we can use Azure Index as an large scale vector database.
My Use case:
- Take pdf documents as input (doc size >= 25 pages)
- Divide each document into suitable chunks (used langchain)
- Embed each chunk (used openai embedding model)
- Upload each chunk embedding onto Azure AI Index
The structure of the Azure Index is:
{
id: "----", (string)
title: "title-of-the-document", (string)
chunk:"content-of-chunk", (string)
chunk_embedding:"chunk-embedding" (SingleCollection)
}
Now, I have stored each chunk with its embedding for each document on the index.
In the retrieval step:
- I filter the chunks w.r.t document title.
- To fetch the top k relevant chunks based on the user query, I do am using the search approaches like Vector Search, Hybrid Search etc. which is not giving a proper top k chunks that I can further use as my context knowledge.
- I have also tried using NumPy cosine similarity search, but that is also not satisfactory.
Also, size of the document is the reason I can not directly upload the document and its embedding since it exceeds the context limit.
Does anyone have any suggestion or approach by which I can:
- Use content (chunks, embedding) and user query (content, embedding) to retrieve a proper top k chunks that could then be used as context knowledge.
Thank you