Can you really not add metadata to files in the vector database?

Or am I misreading the API documentation? I suspect I’m not, because older posts seem to indicate this, but I suppose I’m posting this here in case there’s been a feature release that I somehow could not find.

"A vector database that cannot set or save metadata on each file is indeed less versatile and potentially less useful compared to those that do support metadata. Metadata is crucial for context, filtering, and retrieval of vectors, especially in large-scale applications. " - My friend ChatGPT on the topic

4 Likes

Not at this time.

But it’s a very good idea.

Note: you can add metadata to a vector store object, but not the files themselves.

I’m not entirely sure what’s the utility of this metadata key is though (because, honestly I’ve not used it myself) but I imagine you could include information about the files in the vector store in the vector store metadata.

2 Likes

To my knowledge this is not possible. In the docs they say they plan on adding a feature for deterministic pre-search filtering using metadata in the coming months…

1 Like

Hi everyone,
I’m currently exploring the OpenAI API for a project, and I was wondering if there are any updates or plans regarding the availability of a file metadata feature. Specifically, a capability to upload or manage files with associated metadata would be incredibly useful for organizing and accessing file-related information programmatically.
Does anyone know if this feature is on the roadmap, or has there been any mention of its potential release? Any insights would be greatly appreciated.
Thank you!
J

1 Like

Metadata filtering has landed in the new Response API: https://platform.openai.com/docs/guides/tools-file-search?lang=javascript#metadata-filtering

To my knowledge this is not possible. In the docs they say they plan on adding a feature for deterministic pre-search filtering using metadata in the coming months…

3 Likes

That’s cool, but there is no way to attach metadata to the File in Vector Store for now. :upside_down_face:

Has there been any update yet? Without the ability to attach metadata, it’s impractical to use. If we don’t control chunking, important data might never make it into the answer during retrieval.

For me, the biggest advantage of the OpenAI Vector Store is not having to handle evaluation, scoring, etc., but I need to be sure it has access to all the data. Otherwise it defeats the purpose of the simplification. Either let us control chunks or allow metadata—metadata is obviously better.

Yes, you can create a Markdown (MD) file per record with only the key data and put everything into a larger chunk and hope it works out, but that’s still not guaranteed and feels like a half-measure.

Thank you in advance for the information.
Aleš

Just for you, I went in and added a metadata feature on the Responses endpoint.

When you attach a file to a vector store, you can provide “attributes”, an object with key:value pairs, up to 16.

You also will be able to control chunking size, another concern of yours that I addressed.

Then when you employ vector storage as file_search retrieval, you can write your own filter for the input files based on matching the attribute metadata, if you can figure out how to use that dynamically in a chat context to drop certain unmatched files in the vector store:

I hope you find this feature complement useful. (It’s been there for quite a while, actually).

I understand the attributes, but they’re not a replacement for classical metadata. If I find the closest match, the OpenAI store still can’t automatically pull in the chunks with the same attributes. So yes, it can be used as a standard vector store — but without proper metadata, meaning you’ll still have to filter by attributes after retrieval.

The biggest added value I see in the OpenAI vector store is simplicity: upload a file, nothing else to worry about, then get a response. But with longer texts this becomes an issue; for short rows that fit into a single chunk it’s great — no problem there. However, with AI instructions at the record or table level, there isn’t much space left and the text gets split. That means something important might unnecessarily get lost in the answer.

So if it were possible to either a) add proper metadata — i.e. record details, text, etc. — or b) have the system automatically fetch all chunks with the same attributes in the background and include them in the response, that would be fantastic and would really move the service forward. And technically, it probably isn’t even that complicated. Hopefully we’ll see it someday.

The files’ chunks are never passed as returns if they are filtered out as a query.

If you want a keyword-ranked knowledge database of discrete entries, you likely want a database function and not vector stores based on embeddings, presented as a function.

Why? I think this is actually a legitimate request. Using semantics while also pulling metadata directly into the AI response. That would be a great feature and, above all, a big added value.

Aha: it is an opportunity and a vertical for you to exploit, a market window not fulfilled by simply consuming OpenAI products that every developer would then have turnkey access to.

Everything about “Responses” and internal tools simply mirrors the thought of “we’re just gonna give you ChatGPT if you can’t do better yourself”.

Take for example the injection message that comes right before every prompt when you employ file_search:

That should tell you right off the bat this vector store product is not suitable for more than a user asking about their documents.

Your actual need is hard to imagine, filtering “chunks”. I’m sure you’ll figure out how to do it.

1 Like

Hi, thanks for your reply. Well, I don’t think it’s that clear-cut how it should work. Because it’s logical that not everyone is an experienced programmer, and people naturally look for the easiest solution. I’m not a programmer myself, but I enjoy following new tech and I work at a company with about 20 people, where I wanted to try building a RAG system on my own.

I managed to do it using Baserow as the data source, Qdrant as the VS database, and I built the whole flow in N8N, everything self-hosted on my own server. It all works, but since it’s not my core business, I do get a bit tired of managing query prompts, scoring, monitoring, etc. So I started looking for an easier solution, even a paid one.

ChatGPT Enterprise does have this option, but with big file limitations, and besides, it’s quite expensive to roll out for 20 people at a company our size. The money might be acceptable, but the file number limitation is restrictive. I don’t want to merge files for many reasons. So I thought, well, nothing to do about it — there just isn’t another solution than the one I already have.

And then I came across vector store, which basically does almost everything. Yes, it’s a bit of a black box, but I don’t mind that — if something works, it’s fine for me, since I’m not a programmer as I said. And because I already have all my data nicely organized in Baserow, I quickly spun up a test version using the store, and I ran into the issue that it found the record, but didn’t return the important field, simply because it was cut off in a bad chunk.

So yes, I can upload multiple types of source files per record into the vector store, and yes, that probably solves it. But still, it’s a pity — because actually very little would be needed. If only every chunk carried its file_id with the files, and when returning a response, all chunks with that file_id would be retrieved. There could easily be a context window per file if we wanted to enforce that.

I can of course put this in attributes, but that doesn’t help me in this case. Or rather, it does, but then I’d need to stitch the answers together myself. It would just be one additional function that could make the store more attractive to more people. I understand that professionals don’t need it, and those who don’t understand the tech won’t care either. But I think there are quite a few people in between — people like me who enjoy it, but don’t want to spend time managing flows, scoring, and so on.