Embedding large number of sentences

I am trying to create an embedding based upon more then 15000 sentences, however when I run the code with more then 2048 sentences the embedding fails because of the token limit. The docs has a list containing the more then 15000 sentences. Is there a way to create such an embedding, by changing something in my code?

response = openai.Embedding.create(
input=docs,
model=“text-embedding-ada-002”)
embeddings = response[‘data’][0][‘embedding’]
print(embeddings)

If your billing up to date? Credit card attached?

What error is it failing with?

Yes, it only works when I use a sample of the data.

The error does not seem to be related to the problem at hand, since it works with a sample of the data below 2048 sentences.

The error is: InvalidRequestError
----> 2 response = openai.Embedding.create(
3 input=docs[1000:3049],
4 model=“text-embedding-ada-002”)

(THE DOCS) is not valid under any of the given schemas - ‘input’

Are you aware there’s a limit on tokens for embedding?

2 Likes

Indeed, it is 8191 tokens max, which is approx 6000 words. Assuming each sentence as 15 words, this is 400 sentences. But the real limit is tokens.

2 Likes

Yes, I know there is a limit on tokens for embedding.

However, when I used another embedding model, like “all-MiniLM-L6-v2” that has a token limit of 128 tokens it worked, because the limit only applied within the sentences, so no sentence had more then the token limit.

When I apply the same algorithm with “text-embedding-ada-002” the limit seems to be applied to the quantity of sentences, not to the texts within them. My problems is that if that is true I would have to combine the sentences and that would not provide the result that I need, since I need to know to which cluster each of the more then 15000 sentences belongs to.

1 Like

Yes you get a single embedding vector for each chunk of text you pass to the API. So if you want vectors for each sentence, you have to split the data into sentences and feed them into the API one at a time.

People embed at different levels, word, sentence, paragraph, page, to form different contexts. The API provides this freedom.

2 Likes

Yes, I made some tests and that seems to be the solution. I checked how the embedding worked with the other model and it did one for each sentence.

Thank you guys for the help, I will make some more tests and provide the code here if it all works correctly.

2 Likes

Good to hear. Best of luck with your project!

I was able to run the code but it is very slow, because it sends one request at a time in a loop. Is there a way to send all requests at once for openai, so that it can be parallelized?

use langchain for that: Embeddings — 🦜🔗 LangChain 0.0.100

it will automatically handle the split and parallelization

Wow it worked really well, 100x faster. Thanks a lot.

Do you know how it works? I tried reading the docs but it is not clear if it passes all sentences as one text and then splits them in the return, or it just sends each sentence in parallel, or something else.

Did you find a some answers for your questions ?
Because I met with the same situation why that collection works

1 Like