Problem
Currently, vector store files only return usage_bytes
(storage size), but developers need to know the actual content length of their documents for cost estimation, debugging token limits, and usage analytics.
Proposed Solution
Add three simple fields to the vector store file object:
{
"id": "file-abc123",
"object": "vector_store.file",
"usage_bytes": 1234,
"created_at": 1698107661,
"vector_store_id": "vs_abc123",
"status": "completed",
"last_error": null,
"chunking_strategy": { /* ... */ },
// New fields:
"token_count": 2450,
"character_count": 12500,
"chunk_count": 4
}
Fields:
token_count
: Number of tokens in the document (same count used for 5M token limit)character_count
: Total character count of the processed textchunk_count
: Number of chunks the document was split into
Use Cases
- Cost Estimation: Know actual token counts for embedding cost calculations
- Usage Analytics: Show meaningful document statistics to users
- Debugging: Understand token limit issues and chunking behavior
- Optimization: Make informed decisions about document processing
This would be a backwards-compatible addition that provides essential information currently computed internally but not exposed to developers.