I’m using the OpenAI API to create a Vectorstore for the new Assistant. While I can easily add files to the Vectorstore, I haven’t found any information in the API documentation on how to control the chunk size used when vectorizing the document during the file addition process.
I’m interested in exploring different chunking strategies. By controlling the chunk size, I hope to optimize the Vectorstore’s performance and retrieval accuracy.
If you have any guidance on how to manage chunk sizes when creating a Vectorstore using the OpenAI API, I would greatly appreciate your help
@vb I did see that I was able to see the file chunking strategy from before the file was attached to the Vector Store. Do you know if changing the chunking strategy changes the vector store representation of the file? Do they keep it around in case the file needs to be disconnected from the vector store?
from openai import OpenAI
from vector_db import get_files_uploaded_to_vector_store, get_vector_store_id
name = 'v_1'
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
vs_id = get_vector_store_id(client,name)
vs_files = get_files_uploaded_to_vector_store(client,vs_id)
print(vs_files)
You would need to remove a file from a vector store and re-add it with new chunk size parameters. Then the file extraction, chunking, and embeddings is performed again. You can then check the vector store metadata to observe the change with the new file id return given.