Hi ,
I have a dataset that contains about four hundred rows , I want to use text embedding ada 002 model api to generate embeddings for some specific columns can somebody please give me guidance on how to handle this large amount of data and feed it to the model through api because I know there are rate limits and other obstacles .
Thanks in advance
400 rows is not a large amount of data.
While there are rate limits, they are quite high for ADA if you us batching.
Currently, I batch 500 text snippets per request to the API to generate embeddings for text snippets. For you, it could be similar, but you’d send column values instead of text snippets.
Anyway – let’s say you have 5 columns per row, and 400 rows. If you want to send 500 items per batch, you’d get (5*400)/500 == 4 batches.
Separately: What’s in these columns? Embeddings are useful for semantic closeness matching, but aren’t good for more traditional database queries like substrings or value range matches.
Thanks for your reply ,
I am building a job recommendation system , ai need to get embeddings for job descriptions and job requirements , they are two columns in the dataset .