I have uploaded a bunch of JSON artist profiles and added them to a vector store for my assistant.
When queried about the artists, the assistant finds a good group of profiles to respond with, and does so in markdown. So far, so good.
PROBLEM
For each artist in the list, it includes a photo, taken from a specific path the JSON that the assistant has been provided. This part usually works, but sometimes doesn’t.
When it doesn’t, it seems to have hallucinated the photo URL. That is, the URL does not exist at all in the source files.
Also, the broken images are usually on the same artist, but sometimes it actually shows the correct image for them, so it’s not like that file is just ‘broken’.
I have verified that there are no failed uploads to the vector store.
THEORY
Somehow the chunking of the file during upload to the vector store cut into the URL and had some ill effect in the embedding?
QUESTIONS
-
If it’s chunking, a different strategy probably wouldn’t help unless it’s larger than the biggest file?
-
Once the model has identified a file from an embedding, can it not go back to the source of truth (the actual file) for something like a long url that needs to be correct 100% of the time?