Please help: why embedding took more 500 Minutes

Dear Gents,
I am trying to use Openai Tutorial “https://platform.openai.com/docs/tutorials/web-qa-embeddings/building-a-question-answer-system-with-your-embeddings”; I did some changes to be able to use latest Openai methods.
I successful ran it on small website ; and gave very good answers.

but when I used it for large web site , the following python cell took more than 500 minutes without finishing
################################################################################

Step 10

################################################################################

Note that you may run into rate limit issues depending on how many files you try to embed

Please check out our rate limit guide to learn more on how to handle

this: https://platform.openai.com/docs/guides/rate-limits

df[‘embeddings’] = df.text.apply(
#lambda x: openai.Embedding.create(input=x, engine=‘text-embedding-ada-002’)[‘data’][0][‘embedding’])
lambda x: create_embedding(x, api_key=“my api key oooooooooooooooooo”))
df.to_csv( ‘processed/’ + local_domain + ‘/embeddings.csv’)
df.head()

the size of the scraped.csv file is 17.393M ; is this huge file for embedding…
the total number of tokens inside the data frame is 6,769,846… is this huge number of tokens.
I have $5.70 / $10.00 as balance ; sorry for sharing this info but to avoid anu surprises

any help please ; I am in bad need to finish this embedding task

Regards,
Omran

That much data will take awhile, I’m afraid.

But be sure to store it in persistent storage, like some sort of database, as the tutorial is only meant as a quick demo, not for production, here is the fine-print:

2 Likes

Thanks Curt…
I did not get the point regarding the screenshot.
is it general screenshot or it a follow-up/tracking shot from my current running embedding.

thanks for your support

The screenshot is the fine-print at the bottom of the link to the demo you shared.

Normally, you embed some stuff, store it in persistent storage. So if you put everything in your data frame, and don’t save, you lose all your embeddings, and have to re-embed. :frowning_face:

What I do, is save all my embeddings in a database, but extract the vectors into a file, like a Python pickle, and then load this pre-processed file into memory for correlation. Then the text is referred back via database by getting the hash corresponding the highest correlated vectors.

Others find it easier to use a managed vector DB provider, like Weaviate. This is what “makes embedding hard”, which is where to store the data, and the related processing details.

But whatever you do, you don’t want to embed all that data over and over, which is what the demo does if that’s all you use.

So break your problem into stages. Do a big chunking/embedding phase … will take awhile, and store your stuff off so that you can re-use it.

Then form your processing/correlation stage, which is best to do in-memory for speed, or use a provider that will do this for you.

Finally, your prompting stage, where you craft your prompts to feed to the LLM. This usually includes recent history, some older history, and the relevant information retrieved from your correlations.

1 Like

I’ve been considering using an OpenSource small model to do embeddings, rather than Cloud Services because of the fact that many use cases of embeddings require you to get a vector for every paragraph of text (or whatever your smaller artifacts are), and that’s a lot of load/expense to do over a network.

This seems practical to me because you can use your own embedding solution completely independently of OpenAI, but still use OpenAI (or other Cloud AI) to do the “heavy lifting” of actual question answering. Smaller LLMs with a lower parameter count seem like they might be good enough for creating vectors simply because it’s just semantic keywords, involved, and doesn’t need to be the “cutting edge near-AGI” models that are used for actual reasoning.

I may be, wrong, I’m just sharing some thoughts.

2 Likes

When you’re dealing with large data sets that are taking too long to process, breaking the data into smaller, more manageable chunks can significantly improve performance. This technique, often referred to as “chunking,” allows you to reduce the size of each individual data packet being processed at any given time.

Here’s a detailed explanation of the process:

  1. Chunking: First, you’ll need a mechanism (a chunker) to divide the large dataset into smaller segments. The goal is to split the data in such a way that each chunk is independent and can be processed without needing data from other chunks. This makes it easier to handle both in terms of memory and processing power.

  2. Asynchronous Processing: Running the chunks asynchronously means that you don’t have to wait for one chunk to finish processing before starting on another. As soon as a chunk is ready, it can be processed, even if other chunks are still being worked on. This is crucial for maximizing efficiency because it keeps all your resources busy without idle time.

  3. Parallel Processing: In addition to running tasks asynchronously, you can also process multiple chunks in parallel, especially if you have a multi-core processor or distributed computing resources. This means that several chunks of data are being processed at the same time, rather than sequentially.

  4. Maximizing Throughput: By splitting your data into chunks and processing them asynchronously and in parallel, you can significantly increase the number of data packets (or caps) you process per minute. This leads to faster overall results because you’re making full use of your computing resources, reducing idle times, and avoiding bottlenecks associated with processing large datasets in a single, monolithic block.
    this approach allows you to manage and process large datasets more efficiently. By breaking the data down into smaller pieces, you can make better use of your computing resources, reduce processing time, and achieve higher throughput.

1 Like

Thanks you all for your support. I solved the problem base don your idea of dividing the task into sub task… in each sub-task, I embedded 3000 messages (each message about 500 token) of the over all 15000 messages. it took 15-18 minutes for each sub task…

1 Like

ok next you want to make them asynco parallel and run them in a batch at the same time well keeping in mind your max rate limits. what this will do is make X messages the same processing time as the first one or slightly more reducing your time to about 45 seconds or less depending on your connection.

Enjoy, but watch your costs. With faster processing it’s easy to forget how much data your testing with sometimes lol.