Hi and welcome to the Developer Forum!
Data sent to the API is not used for training, so it will not make it’s way into any training data and is only kept for 30 days for legal compliance and then deleted.
There is a method called RAG which is Retrieval Augmented Generation, it takes advantage of a data storage technique called a vector database. These vector stores hold sematic vectors of text and can be searched by similarity, not on the words but on the underlying meaning held by those words, this makes it ideal for pulling only the relevant information from a large corpus.
You can then pass that information as context to the AI to answer queries.
The method for storing the data is called embedding, details can be found here
https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
The amount of data you have may be too large for the current Assistants model, which is a pre built system that contains all of the above elements in an easy to build API library, so some experimentation may be required. Normal vector database methods have effectively unlimited storage, so there are options to solve your user-case.
Let me know if you need any more information.