Trouble with vector store for assistant, file is too large? It's under 100MB though

TheZeke · June 27, 2024, 11:09pm

Hello,

While creating a vector store with file_ids the file is marked as failed on the OpenAI web UI. The file uploads without issue but can’t be processed.

I’m creating the vector store like this:

        remote_vs = self.client.beta.vector_stores.create(
            name=self.vector_store.name,
            file_ids=[file.id for file in self.vector_store.files],
            chunking_strategy={
                "type": "static",
                "static": {
                    "max_chunk_size_tokens": 2048,
                    "chunk_overlap_tokens": 256,
                },
            },
        )
        logging.info("Remote Vector Store Data: %s", remote_vs)

It is logged:

Remote Vector Store Data: VectorStore(id=‘vs_47QQ6T2PoUzwIi7kjkgjyj2c’, created_at=1719528823, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=1, total=1), last_active_at=1719528823, metadata={}, name=‘StableDiffusionData’, object=‘vector_store’, status=‘in_progress’, usage_bytes=0, expires_after=None, expires_at=None)

It is then subsequently marked failed in the WebUI when I go and check.

The file is a JSON file that is 87003153 bytes.

Also:
If I try using client.beta.vector_stores.files.create_and_poll() like this:

            attached = self.client.beta.vector_stores.files.create_and_poll(
                vector_store_id=self.vector_store.id,
                file_id=file.id,
            )
            logging.info("Vector file attachment status: %s", attached)

It also fails but it gives me a reason at least.

This is logged:

Vector file attachment status: VectorStoreFile(id=‘file-MU2BzuNwxpTPsmSNBD4Rith2’, created_at=1719529404, last_error=LastError(code=‘invalid_file’, message=‘The file could not be parsed because it is too large.’), object=‘vector_store.file’, status=‘failed’, usage_bytes=0, vector_store_id=‘vs_zjv1pOXCfWeycOOhGoMktXYd’, chunking_strategy=ChunkingStrategyStatic(static=ChunkingStrategyStaticStatic(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type=‘static’))

Is this file too big? Despite the fact that the documentation says 512MB with up to 5m tokens?

Reference:
https://platform.openai.com/docs/assistants/tools/file-search/creating-vector-stores-and-adding-files

_j · June 28, 2024, 3:58am

Yes, a text-based file 87MB in size is quite likely to encode to far more than 5M tokens.

You can apply your own chunking to that. Making 87 separate 1MB files from dicing up the JSON will result in knowledge chunks of little difference than the 25k pieces that you likely get anyway were the whole file processed at once - with a return of just 20 of those mid-document chunks from file search to inform the AI in practice.

pedermyh · July 5, 2024, 6:18am

thanks this is liturally the fix for this issue.

Topic		Replies	Views
Max 100 files in vector store API assistants	9	3850	May 10, 2024
Error when uploading json to vector store Community assistants-files	10	1301	August 16, 2024
Help with 413 The data value transmitted exceeds the capacity limit Error in OpenAI Vector Store Upload Bugs assistants-api	1	302	October 21, 2024
Cannot upload large files in Assistant Vector Bugs gpt-4 , chatgpt , api	0	33	February 6, 2025
Assistant V2 files. Size limit API	4	4603	July 13, 2024

Trouble with vector store for assistant, file is too large? It's under 100MB though

Related topics