Feature request: Add document length metrics to Vector Store files

Problem

Currently, vector store files only return usage_bytes (storage size), but developers need to know the actual content length of their documents for cost estimation, debugging token limits, and usage analytics.

Proposed Solution

Add three simple fields to the vector store file object:

{
  "id": "file-abc123",
  "object": "vector_store.file",
  "usage_bytes": 1234,
  "created_at": 1698107661,
  "vector_store_id": "vs_abc123",
  "status": "completed",
  "last_error": null,
  "chunking_strategy": { /* ... */ },
  // New fields:
  "token_count": 2450,
  "character_count": 12500,
  "chunk_count": 4
}

Fields:

  • token_count: Number of tokens in the document (same count used for 5M token limit)
  • character_count: Total character count of the processed text
  • chunk_count: Number of chunks the document was split into

Use Cases

  • Cost Estimation: Know actual token counts for embedding cost calculations
  • Usage Analytics: Show meaningful document statistics to users
  • Debugging: Understand token limit issues and chunking behavior
  • Optimization: Make informed decisions about document processing

This would be a backwards-compatible addition that provides essential information currently computed internally but not exposed to developers.

The information that you describe actually would not be that useful.

Documents are chunked, and the placed chunks from a search are what incur costs, not the source.

The default chunking strategy is 800 tokens, with 400 token overlaps. This makes it easy for you to extrapolate the costs of retrieval information placement. Note: the parameter to change the default number of ranked chunks from 20 was non-functional the last I checked, meaning that you could be looking at 16000 tokens per additional tool call to file search if there is more than that to be placed and you don’t use a threshold.

The embeddings AI doesn’t cost you for having files in a vector store. Only the persistent storage, which is indicated.

What would be ultimately useful is to retrieve the chunked indexed document back, so one can observe parsing failures.