Trouble with vector store for assistant, file is too large? It's under 100MB though


While creating a vector store with file_ids the file is marked as failed on the OpenAI web UI. The file uploads without issue but can’t be processed.

I’m creating the vector store like this:

        remote_vs = self.client.beta.vector_stores.create(
            file_ids=[ for file in self.vector_store.files],
                "type": "static",
                "static": {
                    "max_chunk_size_tokens": 2048,
                    "chunk_overlap_tokens": 256,
        )"Remote Vector Store Data: %s", remote_vs)

It is logged:

Remote Vector Store Data: VectorStore(id=‘vs_47QQ6T2PoUzwIi7kjkgjyj2c’, created_at=1719528823, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=1, total=1), last_active_at=1719528823, metadata={}, name=‘StableDiffusionData’, object=‘vector_store’, status=‘in_progress’, usage_bytes=0, expires_after=None, expires_at=None)

It is then subsequently marked failed in the WebUI when I go and check.

The file is a JSON file that is 87003153 bytes.

If I try using client.beta.vector_stores.files.create_and_poll() like this:

            attached = self.client.beta.vector_stores.files.create_and_poll(
  "Vector file attachment status: %s", attached)

It also fails but it gives me a reason at least.

This is logged:

Vector file attachment status: VectorStoreFile(id=‘file-MU2BzuNwxpTPsmSNBD4Rith2’, created_at=1719529404, last_error=LastError(code=‘invalid_file’, message=‘The file could not be parsed because it is too large.’), object=‘vector_store.file’, status=‘failed’, usage_bytes=0, vector_store_id=‘vs_zjv1pOXCfWeycOOhGoMktXYd’, chunking_strategy=ChunkingStrategyStatic(static=ChunkingStrategyStaticStatic(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type=‘static’))

Is this file too big? Despite the fact that the documentation says 512MB with up to 5m tokens?


1 Like

Yes, a text-based file 87MB in size is quite likely to encode to far more than 5M tokens.

You can apply your own chunking to that. Making 87 separate 1MB files from dicing up the JSON will result in knowledge chunks of little difference than the 25k pieces that you likely get anyway were the whole file processed at once - with a return of just 20 of those mid-document chunks from file search to inform the AI in practice.

1 Like

thanks this is liturally the fix for this issue.

1 Like