Is Assistant APIs RAG using filenames as a semantic value?

When uploading files to a vector store to be used for RAG by an Assistant I am wondering if the filename of each file will have an impact on the retrieval during RAG.

Currently my files have random filenames (actually the filenames are some sort of ID, but they do not hold semantical value). I am wondering if naming the files with a title that somehow summarizes their content might help for RAG.

Can anyone share some insight on that?

Standard chunking methods typically disregard the filename when splitting data. However, you can create a custom chunking strategy that appends metadata—like the filename—to each chunk.

Thx Mark0!

In the Assistant/VectorStore APIs I can only find stuff that lets me change the chunk sizes.

https://platform.openai.com/docs/api-reference/vector-stores/create

There is a metadata field but it seems to be a static map, equal for all files as it is only present in the “create vector store” API, but not in the “create file” API:

https://platform.openai.com/docs/api-reference/vector-stores-files/createFile

Could you provided a pointer into the documentation to the part you had in mind regarding defining a custom chunking strategy that will include the filename into a file’s metadata?

Unfortunately, I can’t. I’m using LangChain’s TextSplitter, which you can find here: Text Splitters | 🦜️🔗 LangChain

As for OpenAI’s vector store, I haven’t work with it. From what I’ve seen in the documentation provided, its capabilities seem fairly limited compared to LangChain.

2 Likes