File Tree in Vector Storage

Is there a best practice for having a file tree in the vector store?

Some other resources online suggested appending a comment with the full file path for each file at the time of upload (which sounds really easy to implement). Has anyone tried that, and did it seem to help at all with the Agent understanding directory file structure?

Has anyone tried other methods or hacks to make the Agent recognize the file tree?

1 Like

Could you explain what you are trying to achieve?

1 Like

The AI is given a myfiles_browser tool, to which it can write search query language. It is a good idea to inform the AI what the myfiles_browser tool will return when what is in the vector store is part of the assistants behavior and operation.

However, utilization is simply that the AI is presented with the ability to write a search query, and the chunks that are most similar to what is written are returned. It cannot explore.

The top-ranked chunks have the source file and an index number after they come back. However, the AI has no way to specify or explore a subdomain of documents.

Therefore:

  • ā€œrecognize the file treeā€ doesnā€™t have much meaning.

If the user is the one adding documents, which are returned by the same search query as any other vector stores in operation, then I can see it being a good idea to amend the AIā€™s knowledge. The file name alone, such as a post-prompt automatically added to a message ā€œ[I uploaded 2835.385.pdf]ā€ might be a good idea to ā€˜activateā€™ more searching.

However, you can see that file name alone could be less than useful. Imagine:

  • if your user interface uploader also had a dialog that asked for ā€œContents: (whatā€™s in the document you are adding)ā€. Forcing it to be spelled out by the user, and not relying on the user input to mention the file, might improve the searching, but could be cumbersome.
1 Like

Iā€™m trying to let the AI differentiate between similarly named files with different file paths.

A real world example could be a git repo for a node.js website: index.html or index.js are ā€œreservedā€ filenames, which represent the default entry point for a directory or the end of a route. There might be multiple index.js files in a single git repo, each file is in a different folder and each file is unique. They all have the same index.js filename but the folders / filetree gives each file a unique route/path/url to differentiate it from the other index.js files.
Similarly an API might have multiple route.js files with each file at the end of a unique path of folders.

If I wanted the AI to answer questions about the code in a specific index.js file, I would need to differentiate that specific file from all of the others with the same name.

There are multiple things around the file itself which are way more interesting. There could be meeting protocols, or chats with ChatGPT which lead to the production of the file.
You may also want to move the files which would mean youā€™d have to rebuild the vector store.
Moving files around without touching them shouldnā€™t lead to that.

So you should at least annotate the file path as meta data but if possible also add the prompts that lead to the creation of the code.
This could make it a lot easier for auto coders to work on the code.

I mean I often just copy and paste a chat with a customer into chatgpt and then start promptingā€¦

I am adding all of the files programmatically during an initialization workflow. Later, on certain future event triggers, files will be programmatically updated / deleted / replaced / etc as needed.

For now what Iā€™ve done is switched from the ā€˜filesā€™ API to the ā€˜uploadsā€™ API. That way I can change the filename during the upload. I am setting the filename on OpenAI to be/contain the entire file path.

That should at least create a semblance of the file tree or file structure, hopefully it will be effective for my use cases.