Parallelism/scaling in embedding endpoint

LinqLover · November 29, 2023, 10:00pm

I wondered whether the embedding endpoint uses parallelism/elastic scaling to handle multiple documents from a single request in parallel. After doing some short experiments, I discovered that it seems not to do so. In consequence, there is no need to maximize your chunk size and send a single request per minute when you are embedding huge corpora, but you may split up your corpus into a larger number of chunks in favor of more fluent progress updates unless network overheads become significant. This observation, however, is only a snapshot of the current server load and the ada-002 model and they might change this behavior in the future. Maybe this saved someone else a few minutes.

_j · November 30, 2023, 2:24am

You are likely referring to the embeddings endpoint’s ability to take not just a string, but also a list of strings (array), and return an embedding for each of them.

There is a clue that lets you understand how this works: the maximum you can send in total of strings is still 8k tokens.

If the endpoint was dispatching multiple inputs to multiple AI instances, this limitation on the total input would not make sense.

What does make sense though is that the entire embedding request is loaded into AI context, and then the hidden embedding state after each sequence is individually processed by that AI is returned. There are “resets” at each input separation as it works through the context.

Interestingly though, with language inference, you get a significant speedup with n>1, which is asking for multiple outputs for the same prompt.

This has to be more than just precalculation of the state from input shared between instances, given the magnitude of the total token rate increase from such a job. Parallelism is indicated.

LinqLover · November 30, 2023, 1:30pm

Yes, you are right, thank you for the additional context. Yes, I need to investigate making parallel requests, but honestly, this is something that the API should handle for me. I noticed the https://github.com/openai/openai-cookbook/blob/main/examples/api_request_parallel_processor.py but I will have to port that to the language I am working with.

What are you referring to with language inference? Are we still talking about embeddings?

_j · November 30, 2023, 6:00pm

Inference → completion that deduces the best output → talking to a chatbot

Topic		Replies	Views
Embedding large number of sentences API	13	11218	December 25, 2023
Struggling to achieve fast, parallel, embeddings API embeddings , gpt-4 , api	1	372	December 5, 2024
Semantic embedding: super slow 'text-embedding-ada-002' API	12	8586	December 24, 2023
Parallelise calls to the API - is it possible and how? API	13	45465	December 13, 2023
Simultaneous Requests - API API	5	5080	June 3, 2023

Parallelism/scaling in embedding endpoint

Related topics