Is Assistant APIs RAG using filenames as a semantic value?

deepwell · November 26, 2024, 7:49am

When uploading files to a vector store to be used for RAG by an Assistant I am wondering if the filename of each file will have an impact on the retrieval during RAG.

Currently my files have random filenames (actually the filenames are some sort of ID, but they do not hold semantical value). I am wondering if naming the files with a title that somehow summarizes their content might help for RAG.

Can anyone share some insight on that?

MARK0 · November 26, 2024, 7:53am

Standard chunking methods typically disregard the filename when splitting data. However, you can create a custom chunking strategy that appends metadata—like the filename—to each chunk.

deepwell · November 26, 2024, 8:06am

Thx Mark0!

In the Assistant/VectorStore APIs I can only find stuff that lets me change the chunk sizes.

https://platform.openai.com/docs/api-reference/vector-stores/create

There is a metadata field but it seems to be a static map, equal for all files as it is only present in the “create vector store” API, but not in the “create file” API:

https://platform.openai.com/docs/api-reference/vector-stores-files/createFile

Could you provided a pointer into the documentation to the part you had in mind regarding defining a custom chunking strategy that will include the filename into a file’s metadata?

MARK0 · November 26, 2024, 8:32am

Unfortunately, I can’t. I’m using LangChain’s TextSplitter, which you can find here: Text Splitters | 🦜️🔗 LangChain

As for OpenAI’s vector store, I haven’t work with it. From what I’ve seen in the documentation provided, its capabilities seem fairly limited compared to LangChain.

Topic		Replies	Views
Assistant RAG file management strategy: Many chunks or many files? API rag , assistants-api	7	821	December 5, 2024
RAG questions with assistants V2 API	1	1626	July 18, 2024
What is the chunking strategy used by the Assistant? API assistants	6	6629	December 5, 2024
Control chunk size when adding files to a Vectorstore for the new Assistant? API	5	2993	September 19, 2024
File retrieval in assistant uses huge amount of input tokens API assistants-api	11	3431	June 12, 2024

Is Assistant APIs RAG using filenames as a semantic value?

Related topics