Feature request: Add document length metrics to Vector Store files

EwoutH · June 13, 2025, 10:21am

Problem

Currently, vector store files only return usage_bytes (storage size), but developers need to know the actual content length of their documents for cost estimation, debugging token limits, and usage analytics.

Proposed Solution

Add three simple fields to the vector store file object:

{
  "id": "file-abc123",
  "object": "vector_store.file",
  "usage_bytes": 1234,
  "created_at": 1698107661,
  "vector_store_id": "vs_abc123",
  "status": "completed",
  "last_error": null,
  "chunking_strategy": { /* ... */ },
  // New fields:
  "token_count": 2450,
  "character_count": 12500,
  "chunk_count": 4
}

Fields:

token_count: Number of tokens in the document (same count used for 5M token limit)
character_count: Total character count of the processed text
chunk_count: Number of chunks the document was split into

Use Cases

Cost Estimation: Know actual token counts for embedding cost calculations
Usage Analytics: Show meaningful document statistics to users
Debugging: Understand token limit issues and chunking behavior
Optimization: Make informed decisions about document processing

This would be a backwards-compatible addition that provides essential information currently computed internally but not exposed to developers.

_j · June 13, 2025, 11:41am

The information that you describe actually would not be that useful.

Documents are chunked, and the placed chunks from a search are what incur costs, not the source.

The default chunking strategy is 800 tokens, with 400 token overlaps. This makes it easy for you to extrapolate the costs of retrieval information placement. Note: the parameter to change the default number of ranked chunks from 20 was non-functional the last I checked, meaning that you could be looking at 16000 tokens per additional tool call to file search if there is more than that to be placed and you don’t use a threshold.

The embeddings AI doesn’t cost you for having files in a vector store. Only the persistent storage, which is indicated.

What would be ultimately useful is to retrieve the chunked indexed document back, so one can observe parsing failures.

Topic		Replies	Views
File retrieval in assistant uses huge amount of input tokens API assistants-api	11	2899	June 12, 2024
Vector store file granularity recommendation API	1	311	May 31, 2024
Vector Stores Documentation Batching Not Explicit Enough Documentation openai-documentation , file-uploads , vector-store	0	172	July 11, 2024
File Search pricing (retreive the docs info) API pricing	4	1738	June 5, 2024
Proposal: Introducing an API Endpoint for Token Count and Cost Estimation Feedback api	4	1412	September 22, 2024

Feature request: Add document length metrics to Vector Store files

Problem

Proposed Solution

Fields:

Use Cases

Related topics