Question about data security and size of vectorised knowledge base

I want to understand the following

  1. Does open AI retain the information it gets from these queries and documents? If the documents used to create the vectorized knowledge base if highly sensitive and private, is it advisable to use open AI and its APIs to help process that data?

  2. How large can this vectorized database be? Are there any limits? Will the LLM be able to go over a large pool of embedded data and give out correct information?

  1. OpenAI states in it’s documentation now days that they don’t save or use the data to train their models. However, it you wanna be even more secure, it’s safe to employ like an anonymiser on the input which you send to GPT (ours is in a JSON, so we anonymise the values for the keys)

  2. I don’t think there are limits per se, but there is a chance of some contextual loss when the vector database is very very massive.