Embedding large number of sentences

andgomes · February 27, 2023, 3:39pm

I am trying to create an embedding based upon more then 15000 sentences, however when I run the code with more then 2048 sentences the embedding fails because of the token limit. The docs has a list containing the more then 15000 sentences. Is there a way to create such an embedding, by changing something in my code?

response = openai.Embedding.create(
input=docs,
model=“text-embedding-ada-002”)
embeddings = response[‘data’][0][‘embedding’]
print(embeddings)

PaulBellow · February 27, 2023, 4:40pm

If your billing up to date? Credit card attached?

What error is it failing with?

andgomes · February 27, 2023, 4:52pm

Yes, it only works when I use a sample of the data.

The error does not seem to be related to the problem at hand, since it works with a sample of the data below 2048 sentences.

The error is: InvalidRequestError
----> 2 response = openai.Embedding.create(
3 input=docs[1000:3049],
4 model=“text-embedding-ada-002”)

(THE DOCS) is not valid under any of the given schemas - ‘input’

PaulBellow · February 27, 2023, 5:01pm

Are you aware there’s a limit on tokens for embedding?

curt.kennedy · February 27, 2023, 5:37pm

Indeed, it is 8191 tokens max, which is approx 6000 words. Assuming each sentence as 15 words, this is 400 sentences. But the real limit is tokens.

andgomes · February 27, 2023, 6:18pm

Yes, I know there is a limit on tokens for embedding.

However, when I used another embedding model, like “all-MiniLM-L6-v2” that has a token limit of 128 tokens it worked, because the limit only applied within the sentences, so no sentence had more then the token limit.

When I apply the same algorithm with “text-embedding-ada-002” the limit seems to be applied to the quantity of sentences, not to the texts within them. My problems is that if that is true I would have to combine the sentences and that would not provide the result that I need, since I need to know to which cluster each of the more then 15000 sentences belongs to.

curt.kennedy · February 27, 2023, 6:23pm

Yes you get a single embedding vector for each chunk of text you pass to the API. So if you want vectors for each sentence, you have to split the data into sentences and feed them into the API one at a time.

People embed at different levels, word, sentence, paragraph, page, to form different contexts. The API provides this freedom.

andgomes · February 27, 2023, 6:35pm

Yes, I made some tests and that seems to be the solution. I checked how the embedding worked with the other model and it did one for each sentence.

Thank you guys for the help, I will make some more tests and provide the code here if it all works correctly.

PaulBellow · February 27, 2023, 6:37pm

Good to hear. Best of luck with your project!

andgomes · March 3, 2023, 4:48pm

I was able to run the code but it is very slow, because it sends one request at a time in a loop. Is there a way to send all requests at once for openai, so that it can be parallelized?

jlqueguiner · March 4, 2023, 10:26am

use langchain for that: Embeddings — 🦜🔗 LangChain 0.0.100

it will automatically handle the split and parallelization

andgomes · March 4, 2023, 3:03pm

Wow it worked really well, 100x faster. Thanks a lot.

Do you know how it works? I tried reading the docs but it is not clear if it passes all sentences as one text and then splits them in the return, or it just sends each sentence in parallel, or something else.

Gonchik · August 27, 2023, 12:30pm

Did you find a some answers for your questions ?
Because I met with the same situation why that collection works

Topic		Replies	Views
Embedding model token limit exceeding limit while using batch requests API embeddings , token , batching	8	22256	October 15, 2023
Embedding Longer Texts API	8	14335	December 25, 2023
Problems using Embedding API API embeddings	2	2476	December 18, 2023
What are the valid embedding input values? API	6	2350	December 25, 2023
Parallelism/scaling in embedding endpoint API embeddings , ada	3	1716	November 30, 2023

Embedding large number of sentences

Related topics