How to Map source_id to File Names in OpenAI Assistant Annotations?

Hi OpenAI Community,

I’ve uploaded files to a vector store using the OpenAI UI and integrated them with an assistant configured for file search. The assistant works fine, and annotations are included in the responses, but I’m unable to map the source_id in the annotations to the corresponding file names.

Example Problem:

Here’s an example of the annotations I’m receiving:

"annotations": [
    {
        "source_id": "8:3",
        "file_name": "Unknown",
        "quote": ""
    }
]

I’ve tried:

  1. Fetching file metadata using /v1/files, which returns file_id and filename, but there’s no clear link between source_id (e.g., 8:3) and file_id.
  2. Checking the annotation structure, but source_id doesn’t seem to directly match the uploaded file data.

Goal:

I’d like to resolve the source_id to the file names (e.g., PricingPlansSummary.md) to enrich my logs like this:

"annotations": [
    {
        "source_id": "8:3",
        "file_name": "PricingPlansSummary.md",
        "quote": "Unlimited scalability"
    }
]

Questions:

  1. How can I map source_id to file names for uploaded vector store files?
  2. Is there a specific endpoint or process I should use to retrieve this mapping?

Any help would be greatly appreciated!

Thanks in advance!

Update:

Since my initial post, I’ve made some progress and tried several approaches to resolve the issue of mapping source_id to file names. Here’s what I’ve done so far:

  1. Fetching Metadata via /v1/files:
  • Successfully retrieved metadata for all uploaded files, including file_id and file_name.
  • However, there’s no apparent relationship between the source_id (e.g., 8:3, 22:4) in the annotations and the file_id values from the metadata.
  1. Dynamic Matching:
  • Attempted to match source_id prefixes (e.g., 8 from 8:3) with portions of the file_name or file_id in the metadata cache.
  • Despite debugging, this approach didn’t yield reliable results, as no consistent connection between source_id and metadata attributes was identified.
  1. Logs and Results:
  • The metadata cache shows all file names and IDs correctly, but when resolving source_id to file names, we encounter errors like:
ERR+0000R: No file_id found for source_id prefix '22' in metadata.
  • Annotations in the chat log remain incomplete, with file_name still marked as "Unknown".

Current Challenge:

The core issue is that we’re unable to identify how source_id (e.g., 22:4) relates to the uploaded files’ metadata (file_id or file_name). Without this relationship, we cannot enrich the annotations with the corresponding file names.

It appears that source_id refers to sections or parts of the uploaded files, but I haven’t found a direct way to map them to the corresponding file_id or file name retrieved from /v1/files.

I haven’t found a clear explanation of how to link source_id to file metadata.

Questions:

  1. Is there any endpoint, API parameter, or metadata structure that directly links source_id to a specific file_id or file_name?
  2. Is the source_id generated in a predictable way, such as based on the file content or upload order? If so, how can we extract this relationship?
  3. Can the source_id be adjusted Is it possible to adjust the data returned in annotations to include metadata like file_id or file_name directly?

Any insights or suggestions would be incredibly helpful as we’re stuck at this stage. Thank you!

I would just wait. Clearly, they have this ability internally and will hopefully release it as part of the AssistantsAPI when it gets its next release (i.e., we shouldn’t have to work this hard to get this kind of functionality).

So are we saying there is no way to get the link currently? Is there a way to include metadata like file_id or file_name directly into the chat response rather than the source_id?

I’ve tried it before and never got it to work—or at least it got to the point where I just thought they should make this way easier.

The head of API was asking for feature requests on X yesterday, and his response to “easy RAG” across all endpoints seem to suggest that a more robust solution is in the works. Didn’t they buy a vector CMS last year?

All this day to say I’d just wait. The next two months will be excruciating for those of us who love to build but I guess we will see everything in March.

OK after a lot of trial and error, I finally resolved the issue of mapping source_id (e.g., 8:3) in OpenAI Assistant annotations to corresponding file names (e.g., PricingPlansSummary.md). I hope this guide helps anyone facing similar challenges!


Solution Summary

To achieve this, I combined a custom function in the OpenAI Assistant configuration and changes in my backend script.


Step 1: Add a Custom Function in OpenAI Assistant

I added the following function to my assistant configuration under “Tools” to resolve source_id to file names using metadata:

{
  "name": "resolve_source_id",
  "description": "Resolve a source_id to its corresponding file name using metadata and include file name as annotations",
  "strict": true,
  "parameters": {
    "type": "object",
    "required": ["source_id", "metadata"],
    "properties": {
      "source_id": {
        "type": "string",
        "description": "Unique identifier for the source file"
      },
      "metadata": {
        "type": "array",
        "description": "Array of metadata entries that include file name information",
        "items": {
          "type": "object",
          "properties": {
            "id": {
              "type": "string",
              "description": "The unique identifier for the metadata entry"
            },
            "file_name": {
              "type": "string",
              "description": "The file name corresponding to the source_id"
            }
          },
          "required": ["id", "file_name"],
          "additionalProperties": false
        }
      }
    },
    "additionalProperties": false
  }
}

Step 2: Adjust Your Backend Script

On the backend, I modified my script to:

  1. Capture source_id values from the streamed AI responses.
  2. Use the resolve_source_id function to map each source_id to its corresponding file_name.
  3. Replace source_id in annotations with the file_name for better readability in chat logs.

Here’s an overview of the key functions I implemented:

  • capture_source_id_from_streamed_events: Extracts source_id from streamed AI responses.
  • resolve_source_to_file: Maps source_id to file names using metadata from OpenAI’s /v1/files API.
  • log_chat: Logs enriched annotations (with file_name and source_id) for debugging and transparency.

Outcome

After implementing the above changes:

  1. I now see annotations like 【news_and_articles.md†14:3】 in my AI responses and logs, where the file name (news_and_articles.md) is included.
  2. This makes it easier to trace sources and validate the AI’s responses.

I hope this helps anyone trying to solve the same problem. Feel free to ask if you need more details, and good luck with your implementation!