Are files uploaded to Assistants API secure?

I am uploading a file while creating ChatGPT Assistant and use it to answer my ChatGPT prompts. I have the below questions,

  1. Is my file data secure or it is exposed in public ?
  2. Is the file only accessible by the Assistant that I create ?
  3. Does that file data’s state gets deleted as soon as the thread dies ?

Hi and welcome to the Developer Forum!

  1. Any data uploaded to an assistant should be considered public domain, while it is not directly accessible, any and all of the files contents can be programmable extracted via prompting.
  2. The file is accessible by anyone in your organisation
  3. Not sure what you mean by state, but the file itself will remain until deleted by you. This also incurs a storage charge on a daily basis of $0.20 per day per gigabyte pro-rata.
2 Likes

Thank you @Foxalabs
Companies require data stored in a private space, how do we overcome this challenge ? As I still need to use ChatGPT and provide file contents extracted via prompting to our customers

Securing data sources used by LLM’s is an active area of research, there are some experiments with decoupling network layers with embeddings, protecting prompts with other prompts and systems to detect key tokens from the prompt occurring in the output, but I have yet to see a practical working solution to this that is not defeated by fairly simple prompting.

3 Likes

Okay.

The file is accessible by anyone in your organisation → does this mean the file data can only be accessed by my ChatGPT API Key ?

Those with access to an application that made use of your API key, or those with direct access to the key could, via prompting or code, access at least parts of the file via the retrieval functions.

1 Like

The models spit out a lot of random data. Whisper is the best, you often get some hilarious stuff if your upload is silence.

I have yet to see a practical working solution to this that is not defeated by fairly simple prompting.

Or a behaviour that is just stumbled across. While testing some coding functionality yesterday I sent a one word message of ‘str?’ to gpt4 epecting a response about type conversion. The response was what looked like a database dump with lots of names and numbers, definitely not my data.

1 Like

Hallucinating realistic looking data is not the same thing as recalling information, the model can generate fantastic looking datasets that are actually useful as test sets by making use of it. While the model is also able to retrieve specific data when asked, it’s no different to asking a search engine, if that data was public domain, it’s reasonable to expect it to be recalled.

1 Like

Exactly. It looked like it interpreted ‘str?’ as a dbase request and spit out an appropriate response. I did not check if it was real data as I did not want to look at it or have it on my machine if it was.

This was with a fine-tuned gpt3.5. The fine-tuning was done in a strange way on purpose to see how it reacts to different strategies, and coax out behaviours, so unexpected results were expected.

So, is it fair to say that files uploaded to the Assistants API can theoretically be accessed by anyone who has the API key used to upload the file via the Assistant client?

No. @Foxalabs wrote:

You need to be aware of this quirk.

1 Like

Okay, is there any projection as to when there will be a privacy layer on the API to keep sensitive data protected?

Think of it like this:

You have a set of text books that you use to train engineers who visit clients, you tell the engineers not to talk to clients about the contents of the text books, but clients may offer money or bribes to engineers to get told bits of the inside info from the text books.

As it stands right now, that is the level of protection you can expect, perhaps some advanced embeddings based isolation system will be created, but there is no timeline on that. For now, do not put data in there you wish to keep secret, or isolate the user from the prompting, create a buffer where an AI is instructed to rephrase user queries and to avoid passing any requests for prompts or internally stored information.

1 Like

Quick clarification:

The issue stems from the fact that the files are not secure from users. However, if you configured the service in such a way that only authorized users could access the service, then the files could be considered secure?

We’re trying to implement something for our internal use. The sensitivity level of the documents included in the training data is basically - For internal use only. All people in the organisation are allowed to access those, but they should not be distributed outside the organisation.

I’ve been setting this up on Azure AI for data security reasons, but I was wondering if we couldn’t have used the Assistants API instead in the first place.

With assistants, the files are essentially available to anyone within the organization and anyone with access to “ask about files” in your app. They can be attached to assistants designed to dump out the contents. They can be dumped out by talking to the AI creatively. OpenAI uses proprietary methods for document retrieval, “browse myfiles”, obscurity being the worst security policy they could offer.

Then you have to trust OpenAI with your data. OpenAI trusts other entities with your data. You have to trust nation-state hacking adversaries not to win. OpenAI’s updated arbitration clauses with the release of assistants let you know they offer you nothing.

Using any AI product that is not your own is ultimately data-leaking, even embedding a semantic search database. You just have to decide if the data is actually trade secret and ruinous.

It’s even deeper than that. mnt/data is shared across instances.

I want to offer a little bit of clarity to this thread, as it seems to have created some misunderstanding about the privacy of files uploaded. Please pitch in if I am misstating anything here.

We need to differentiate between authorized and unauthorized access.

  1. Your file content is accessible by authorized means. since you are building a solution in which you want to use this file content this should not be a concern. Examples: Anyone who can use your app would see the content as controlled by you. Anyone (in your organization) with access to your Open AI account can view the files. That is authorized as well.
  2. Unauthorized access: Some random person checking your file content by using their own keys, their own assistant instances or by going to the location where the file is stored.- No this is not possible. See the third point, which is the gray area.
  3. Gray areas:
    3.1 ChatGPT code can access your files to serve your application. This is authorized. Hence not a concern
    3.2 ChatGPT using your files to server other applications. This would be a concern for organizations where data is private. Review ChatGPT terms for this. I don’t think they use it this way, but I have not reviewed the terms closely yet.
    3.3 OpenAI employees accessing your files and using them in some way. Again, we need to trust OpenAI here, at this point. Solutions such as file encryption and BYOK, which are common on other application platforms are not available here, yet. Trust, governed by the Terms is the only option at the moment.
    3.4 Hackers hacking their way into your files - This is a concern. However, you can do as much as possible about this. No more. Again you need to trust Open AI for having done a good job of protecting against hackers.

If you really do not want to share any files with OpenAI, you have the option to build your own RAG solution through the Function Calls. In that case, you control the source documents and you pass bits and pieces as needed through function calls. Your same question will then apply to those bits and pieces you share. However, those are ephemeral compared to the files you upload and hence your risk exposure is minimized.

1 Like

You probably didn’t mean public domain in the sense of copyright law. In fact, I plan to add in copyright messages to important files so if some malevolent actor gets a copy of them they are on notice that they are infringing.

(It would be nice if there was a magic syntax for comments in files, prompts, etc. that stripped them out before being tokenized!)

Not misstating, but making some assumptions:

Anyone who can use your app would see the content as controlled by you

Emphasis mine.

…apply to those bits and pieces you share. However, those are ephemeral compared to the files you upload and hence your risk exposure is minimized.

There would be no reason to have uploaded documents if you never sent the contents. If a system is well designed with real users, over time all of the contents of those documents would be sent in little bits and pieces.

I am a huge fan of OpenAI and not bashing, these are real concerns. If this is all sorted out it’s good for everyone. What we really need is clarity. Guessing is not good enough for this.

Appreciate those comments.

Agree about the need for clarity. I am hoping to see that clarity from Open AI representatives.

On the content front, I think the original question was concerned about the security of uploaded files. Generally, files sitting on a server for longer periods, unprotected pose a different level of risk compared to information that you exchange over TLS with the intended party, which will be gone after the transaction. My thoughts were from that perspective.