Vector Store API calls returning 504s, 503s, and generally being slow

3 months ago, I set up an AI chatbot that uses resources from the OpenAI vector store as references. The resources within the vector store needed to be cycled weekly, as they are frequently updated, with items removed and added.

To ensure everything is cycled correctly, I built a cron job that clears the vector stores and reloads the new set of resources. The code behind the cron job worked perfectly over the last couple of months, but is now failing when listing files from a vector store.

This is the code I’m using to call the API:

/**
 * Gets all file IDs from a vector store.
 * @param openai - An initialized OpenAI client
 * @param vectorStoreId - The ID of the vector store
 * @returns An array of file IDs
 */
export async function getVectorStoreFileIds(openai: OpenAI, vectorStoreId: string): Promise<string[]> {
  const fileIds: string[] = [];
  let hasMorePages = true;
  let paginationCursor: string | undefined;

  while (hasMorePages) {
    const params: { limit: number; after?: string; } = { limit: 100 };
    if (paginationCursor) {
      params.after = paginationCursor;
    }

    const filesPage = await openai.vectorStores.files.list(vectorStoreId, params);

    if (filesPage.data.length === 0) {
      hasMorePages = false;
      break;
    }

    fileIds.push(...filesPage.data.map((f) => f.id));

    hasMorePages = filesPage.hasNextPage();

    if (hasMorePages && filesPage.data.length > 0) {
      paginationCursor = filesPage.data[filesPage.data.length - 1].id;
    }
  }
  
  return fileIds;
}

I use this to get all of the fileIds so I can then delete them in batches. When I run this, I get either a 504:

Fatal error: InternalServerError: 504 status code (no body)
    at Function.generate (/my-repo/node_modules/openai/src/core/error.ts:100:14)
    at OpenAI.makeStatusError (/my-repo/node_modules/openai/src/client.ts:478:28)
    at OpenAI.makeRequest (/my-repo/node_modules/openai/src/client.ts:728:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async getVectorStoreFileIds (/my-repo/scripts/vector-store-sync/shared.ts:121:23)
    at async clearVectorStore (/my-repo/scripts/vector-store-sync/shared.ts:90:29)
    at async main (/my-repo/scripts/vector-store-sync/sync-resources.ts:82:3)
    at async main (/my-repo/scripts/vector-store-sync/sync-all-vector-stores.ts:34:27) {
  status: 504,
  headers: Headers {
    date: 'Thu, 28 May 2026 09:33:18 GMT',
    'content-type': 'application/json; charset=utf-8',
    'content-length': '899',
    connection: 'keep-alive',
    'retry-after': '120',
    'cache-control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0',
    expires: 'Thu, 01 Jan 1970 00:00:01 GMT',
    'referrer-policy': 'same-origin',
    'x-frame-options': 'SAMEORIGIN',
    'access-control-expose-headers': 'CF-Ray',
    server: 'cloudflare',
    'cf-ray': '<removed>',
    'alt-svc': 'h3=":443"; ma=86400'
  },
  requestID: null,
  error: undefined,
  code: undefined,
  param: undefined,
  type: undefined
}

Or a 503 error:

InternalServerError: 503 System is overloaded. Please retry in a few minutes.
    at Function.generate (/my-repo/node_modules/openai/src/core/error.ts:100:14)
    at OpenAI.makeStatusError (/my-repo/node_modules/openai/src/client.ts:478:28)
    at OpenAI.makeRequest (/my-repo/node_modules/openai/src/client.ts:728:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async getVectorStoreFileIds (/my-repo/functions/src/vector-store-sync/openai-helpers.ts:141:23)
    at async detachAllFilesFromVectorStore (/my-repo/functions/src/vector-store-sync/openai-helpers.ts:62:19)
    at async syncVectorStores (/my-repo/functions/src/vector-store-sync/sync-vector-stores.ts:97:3)
    at async run (/my-repo/tests/manual/test-sync-vector-stores.ts:45:18) {
  status: 503,
  headers: Headers {},
  requestID: '<removed>',
  error: {
    message: 'System is overloaded. Please retry in a few minutes.',
    type: 'service_unavailable_error',
    param: null,
    code: null
  },
  code: null,
  param: null,
  type: 'service_unavailable_error'
}

This is completely disrupting my chatbot and service. What is happening here?

Given that this workflow operated successfully for months and now fails without major logic changes, this looks more like a service-side scalability/regression issue than an application-layer bug.

And a few things that stands out:

  • the failures that occur specifically during paginated listing

  • well both 503 overload and 504 timeout responses are appearing

  • retry-after: 120 suggests the backend is explicitly signaling load pressure

  • and the issue appears intermittent rather than deterministic

So one possible explanation is that vector store file listing performance/regression has degraded for larger stores, causing pagination requests to time out under load.

And a few mitigation ideas worth testing meanwhile :wink:

  • reduce pagination size (limit: 25 or 50)

  • add exponential backoff + jitter for 503/504 responses

  • persist incremental sync state instead of full-store cycling

  • avoid clearing/reloading entire vector stores weekly if possible

  • stagger cron execution windows if multiple stores sync simultaneously

So it may also help to log:

  • vector store file counts

  • response latency per page

  • whether failures correlate with larger stores/page depths

And the fact this worked reliably for months is probably the most important signal here :face_with_crossed_out_eyes:

Hi S_z,

Thank you for such a quick response. I’ll try reducing the page limit and add some back off. Persistence could be an option, but requires significantly more engineering effort and the vector stores are not large. They have, at most, 300 resources.

The idea that listing 300 resources in a vector store is simply too much for the system to handle is laughable. It would be a massive disappointment if this is enough to cause a trillion dollar company’s API to fail.

To clarify, nothing is happening in parallel, I am synchronously running three vector store setups. One of them is empty because it relies on full script completion. For the other two, I have used this code:

const vectorStore = await openai.vectorStores.retrieve(vectorStoreId);
console.dir(vectorStore.file_counts, { depth: null });

To get this output for one that regularly succeeds:

{ in_progress: 0, completed: 172, failed: 0, cancelled: 0, total: 172 }

And this for the other that regularly fails:

{ in_progress: 0, completed: 268, failed: 4, cancelled: 0, total: 272 }

The only obvious difference is the failed count being none zero. Could this be causing the issue?

Yeah, well ~300 resources really shouldn’t be enough by itself to trigger systemic pagination failures.

The failed: 4 correlation is much more interesting though. If the consistently failing vector store is also the only one containing failed entries, I wonder if the list/pagination path is hitting some bad state or retry edge case while resolving failed file metadata internally​:thinking:

Hmm.. Especially since:

  • the successful store has zero failed files

  • the failing store consistently has non-zero failed entries

  • and the failures occur during listing rather than upload.

So it might be worth testing by manually removing/recreating only the failed entries (or recreating the store cleanly) to see whether the pagination instability disappears​:smirking_face:

I deleted the vector store and managed to get my script to complete. However, the specific issue I encountered wasn’t triggered because I didn’t need to list files from this vector store, so it’s unclear whether this issue has been resolved.

I’m not going to re-run my script in case it breaks again and I have to go through the irritating process of deleting the vector store and updating the platform to accommodate.

I don’t know if you work for OpenAI or not, but if you do, please can you raise this internally? I think it’s unlikely I’m the only one facing a problem, and this has already put me off using your vector store product. A breakdown of what happened and why would be great; otherwise, there is zero chance I will use this product again.

Haha nah, I don’t work for OpenAI :sob: just another brokedev goblin reading through the thread​:downcast_face_with_sweat:

But honestly the fact deleting/recreating the vector store temporarily fixed it does make it sound like something state-related may’ve gotten corrupted or stuck internally rather than this being purely a client-side issue​:thinking:

And yeah, I get the frustration. ~300 files really shouldn’t be enough to make a vector store workflow feel fragile​:downcast_face_with_sweat:

Hey @Con, that can definitely be frustrating.

Since you've already tried a few approaches, one workaround that has helped others is treating the update as a migration rather than modifying the existing vector store:

  • Create a new vector store for the updated resource set
  • Add files in batches to reduce write pressure during ingestion
  • Wait until ingestion is fully complete
  • Run a small smoke test against the new store
  • Update the assistant to use the new vector store ID
  • Keep the previous vector store temporarily as a rollback option
  • Delete the old vector store only after the new one is confirmed stable

This isn't ideal, but it can help avoid issues during large updates or re-indexing operations.

Props to @S_z too

Avinash

Calling it a “large update” is an absolute joke. It is, by no means, a large update. It is a piddling, tiny update, and it’s embarrassing that your system can’t handle it.

We are exploring the best way to separate ourselves from your vector store. If this is a widespread problem, I would be astonished if others aren’t also considering how best to leave.

I can understand the frustration.

Thanks for sharing the error outputs and response headers earlier. One thing that stands out is that the 504 example returned requestID: null.

I'd appreciate if you share a recently failed request ID, and if it shows null, please provide:

  • 2–3 recent failure timestamps (including timezone)
  • The SDK version you're using
  • Sanitized full error messages and response headers from those attempts

Avinash