I am using Batch API to embed large amount of documents. OpenAI has a total file size limit per organization of 100 GB, which affects the amount of data I can upload at any time for processing them.
How about the result files? Do their size count towards this limit? Can I simply upload 100GB of documents for embedding and still be able to retrieve the processed batch (which is 4 times the uploaded document).
Welcome to the Forum!
There’s a couple of points to be mindful of here:
-
You would not upload the actual documents but instead you would create the JSONL file with the chunks of text from the documents for embedding. The chunks must be within the token limits of the embedding model you are looking to use.
-
Furthermore, the following constraint is in place for batches:
The file can contain up to 50,000 requests, and can be up to 100 MB in size.
Source: https://platform.openai.com/docs/api-reference/batch