Creating embeddings for large text file from MongoDb

radu.mirigel · April 2, 2024, 2:34pm

Hi, some context, I am using a Mongo database where I have a large texts as fields for my documents and I want to make embeddings for them. The issue is the texts are larger than the maximum tokens for text-embedding-3-large so I was looking for a solution.

I thought that I could split the text into multiple fields when I store it in the database, but I don’t really know if the text embedding can go over multiple fields to make embedings, or if it is a good idea.

Another idea would be to chunk the texts when I make my dataframes to make the embedings, but that would mean making multiple embedings again.

Lastly, if there is a another embeding model I haven’t considered to deal with large texts.

I am using this for my undergrad thesis and I can’t really afford to spend too much money on the API, but this embeding is the only one that works in my language, so I wanted to ask before I spent all my API credits on looking for the best method.

Thank you!

Foxalabs · April 2, 2024, 3:53pm

Hi and welcome to the Developer Forum!

You are describing a technique called “chunking” and it is widely used in embeddings and document retrieval.

You can either develop a chunking algorithm of your own, or there are now many libraries out there that will do it for you, in fact, most popular vector databases have this feature built in.

wclayf · April 2, 2024, 11:53pm

Since the MongoDB team has apparently decided not to share their Vector DB technology in their free software, my plan is to use the Vector DB plugin for Postgresql as my solution, when I get around to doing RAG stuff.

Topic		Replies	Views
Embedding Longer Texts API	8	14042	December 25, 2023
Embedding with large quantity of data API	4	2625	December 25, 2023
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2191	December 15, 2023
Embedding large number of sentences API	13	9960	December 25, 2023
About the usage of ChatGPT Embedding API	9	4244	August 18, 2023

Creating embeddings for large text file from MongoDb

Related topics