Embedding with large quantity of data

ha0n · September 22, 2023, 10:15am

I’m currently working on a Langchain app that reads multiple pdfs and xlsx files. Currently I’m facing an issue where after only a couple of mb I hit my token limit. If I want to feed the llm with, let’s say 10-15gb of data, would embedding be the right approach for this? Are there any good alterantives?

Foxalabs · September 22, 2023, 10:18am

Hi and welcome to the Developer Forum!

You might want to look at rate limiting your requests so that you stay within your current limits, Langchain will add on additional tokens for it’s internal prompts, so that may take some effort to work out, if you have a large requirement for data processing then embedding can be of use but it depends on how you are subsequently using that data.

ha0n · September 22, 2023, 11:24am

Thank you for the quick response. But if I’m already hitting a token limit with small datasets I do wonder whether the embedding approach makes sense in general. Wouldn’t splitting the requests significantly increase the response time and potentially also lower the quality of data returned? What about fine-tuning for larger datasets or running something locally (a huggingface llm for example)?

N2U · September 22, 2023, 11:43am

Hey champ and welcome to the community forum!

10-15gb ia a lot of text, that definitely won’t fit inside the context window of any of OpenAI’s models. I think this sounds like an application for retrieval augmented generation (RAG), you’ll find more information in this paper if you want to know more

Topic		Replies	Views
Creating embeddings for large text file from MongoDb API	2	233	April 2, 2024
Embedding Longer Texts API	8	11113	December 25, 2023
App architecture --> how to send large dataser for analysis (exceeding token limit) API	8	5057	December 17, 2023
Large Datasets for gpt3 api API	5	1029	December 17, 2023
Seeking Advice: Uploading Large PDFs for Analysis with GPT-3 API API gpt-35-turbo , chatgpt , fine-tuning , api	7	5420	December 13, 2023

Embedding with large quantity of data

Related Topics