Creating embeddings for large text file from MongoDb

Hi, some context, I am using a Mongo database where I have a large texts as fields for my documents and I want to make embeddings for them. The issue is the texts are larger than the maximum tokens for text-embedding-3-large so I was looking for a solution.

I thought that I could split the text into multiple fields when I store it in the database, but I don’t really know if the text embedding can go over multiple fields to make embedings, or if it is a good idea.

Another idea would be to chunk the texts when I make my dataframes to make the embedings, but that would mean making multiple embedings again.

Lastly, if there is a another embeding model I haven’t considered to deal with large texts.

I am using this for my undergrad thesis and I can’t really afford to spend too much money on the API, but this embeding is the only one that works in my language, so I wanted to ask before I spent all my API credits on looking for the best method.

Thank you!

1 Like

Hi and welcome to the Developer Forum!

You are describing a technique called “chunking” and it is widely used in embeddings and document retrieval.

You can either develop a chunking algorithm of your own, or there are now many libraries out there that will do it for you, in fact, most popular vector databases have this feature built in.

3 Likes

Since the MongoDB team has apparently decided not to share their Vector DB technology in their free software, my plan is to use the Vector DB plugin for Postgresql as my solution, when I get around to doing RAG stuff.

2 Likes