Need directions to embed and query structured table data for a music recommendation system 🎸

Hi there community, I hope everyone is doing well ::]

I’m using ada-002 embedding model for a recommendation system (along some other similarity search features like generative playlists), so naturally a lot of questions started to pop.

Before going deeper, let me explain what I am building and how the data is structured:

The prototype is a music app with song recommendations based on users history and songs metadata. I have a table formatted database on postgresql with the users, the artists and the songs. Each of these columns have their own rows, for example song have genres, danceability, number of likes, etc.

There are also two more table models for history logs - a “history” (relating with users and songs) and a “session” (a collection of listened tracks in 25 minutes sessions to know "wich song goes better with another) so I can have better data to embed.

What I need is to vectorize the data, store it on a vector database and then perform a similarity search.

However this is where a lot of questions started to pop - basically I need to know how to format this data to then be embedded and get great results as an output. For example, build a queue of next songs to play based on a song that is currently playing and embedded history/sessions logs of the users, or make a generative playlist with songs that have similar parameters (genre, tempo artists).

[Question 1 - storing data]: To build something like this, how should I store my data? I thought about storing data as key:value parameters for example passing this sentece to the embedding model, would this work after to make similarity search to query like “a playlist based in song 2”?

user: User 1, songs listened in the session: [Song 1, Song 6, Song 2 (...)]

user: User 66, songs listened in the session: [Song 9, Song 2, Song 98 (...)]

and then also store the user and song information as vectors like in my database for some better dimension positioning somehow, like

user Id: 123, username: baby, userPreferences: [rock, metal], userCountry: Australia

song Id: 333, songName: cool song, songGenres: [punk, trash metal], danceability: 0,9

[Question 2 - store just the logs or the products and users as well for better “dimensioning”?]: Do I really need to also embed songs and users database column information as vectors, before storing the history/session vectors of an user logs? Or can I simply store directly the session/history?

[Question 3 - ID everything instead of full text?]: should I use userId, songId instead of the actual song name for cutting embedding prices with less tokens? Will this affect queries? And would this require to also store the producs and users?

[Question 4 - best db recommendation for my case]: When storing this on a vector database, is there a db I should consider would be “better” for my key-value and metadata embedding? I know about a couple, so at this point of time wich one sould be suiting best my needs? Milvus, chroma, weaviate, vespa, pgvector and of course pinecone. Please recommend me at least two that could work for me.

[Question 5 - querying similar songs based on many parameters - song, user and history]: In a context that I stored the embedded songs, the embedded users and the embedded logs of history/session - it’s time to get some very cool query results. Suppose I need now to generate a playlist based on songs similar for genre “punk”, or that many users listened together in a session, or even that have similar danceability. But I also need to take in consideration the user - what does he like and have listened to all time, and how this song can be used to show him some new songs that this user specifically would like? In other words, making a multiple table query not just for songs, but songs related to a song, for a user.

[Question 6 - How will this data be returned?]: In a context where I made a successful query in my vector db, I would need a way to then query those in my database and actually serve the song to the frontend from my backend resolver. I would return something like an array of song ids that are perfectly what I needed. Is it possible?
Better example, suppose the query was to get related songs IDs to song 5 for user Id 777. Considering I have stored all the song data history as key:value pairs but in the same sentence, would it be able to return just songs ids I need in an array? like [song 5, song 888, song 11, song 54] , so I can then query those in my actual sql db?

[Question 7 - migrating models from ada-002 in the future]: Also when openAi release more models like ada-003,4,5,etc would it be compatible with our old vectors or would I need to re-embed everything? Same question, but what if I want to use other embedding models from huggingface? Would the embeddings be completely different from the current ones?

My goal is in the future have some awesome other data to be stored with songs, like instruments, engagement rates and more.

Note: I really don’t want to keep exporting CSVs and training a model everytime a new song or use is added to my platform (for example with older models like word2vec), that’s why I thought about embedding this on the fly using ada-002 on the backend resolvers. Also just using a normal database is not the goal here, because I need this to work this way.

If you feel I’m saying totally nonsense stuff, please enlighten me :blush:

Thank you so much! <3

1 Like

hi , im searching answer for this too. mine is on companies data for chatbot. data from postgresql to be embedded and store in pinecone.

Ada-002 is mostly used for text. Have you considered embedding the audio files themselves? There are various options out there specifically for embedding music files.

Example:

1 Like

Hey Curt, thanks for the idea!

But he problem here is a bit different - to store structured table data as vectors (coming from a postgresql database for example).

Column name: row value

For example, if you have columns Name and Date:

Name: Brian
Date: 2023-03-23

I have vectorized songs, they can be similar between each other, ok. That’s a normal product recommender where products relate within each other.

In my case I need more, to considering multiple parameters. I have users and their histories/sessions as well. These are text lists of products, so I can have a better idea wich songs matches others best. This is sometimes refered as matrix factorization.

Also I have users data, where they are from, what genres do they like the most.

The point is how to vectorize this, and return formatted recommendations for this specific user based on his tastes.

If you could answer the questions pointed it would be awesome.

I guess the high level things (genres they like most), I would expose as keywords in the search, and the songs I would expose as embeddings. Then fuse the two together, in some weighted manner using reciprocal rank fusion (RRF).

You have different types of correlation and search, and you can fuse them together to give the overall ranking and top recommendations.

If the keywords are very similar, you would de-prioritize the keyword stream somewhat, but still rank on recent audio embeddings.

I think it is less related to OpenAI because you don’t have a human language text in the input data, not you have to produce one as a result of the model output. It looks like a Keras/PyTorch Recommendation System.
But, if you want to solve it with the OpenAI anyway, you can build a fine-tuned model with the following prompts (for every user in the database):

SYSTEM: You are a helpful music recommendation assistant. 

USER: Build a queue of next songs to play based on a user profile, song that is currently playing and embedded history/sessions logs of the users. 
The user profile format is: name, address, origin, annual income, car mark, monthly budget.
The song history format is: date, song name, artist, genre, danceability, number of links, number of likes, etc...
The output format should be: song name, confidence score.
The returned confidence score should be between 0..1.
Return up to 6 songs sorted by the confidence score.
You can return the songs that the user has already listened to and you are confident that they would like to listen the same song again.
Today is: 24/11/2023
###
User profile: 
Jhon Jhon, LA, Spanish, 60000$, Lincoln, 100$
Song history:
Date1, Song1,  Artist1, ...
Date2, Song2,  Artist2, ...
Date3, Song3,  Artist3, ...
Date4, Song4,  Artist4, ...
Date5, Song5,  Artist5, ...

ASSISTANT: 
Song1111, 0.9
Song2222, 0.5
Song3333, 0.4
Song4444, 0.2

When building fine-tuned model add multiple records from the same user but with cut song history and updated “Today” date. This way you can validate the newest listened songs, like in the Time-Series dataset predictions.
You can ask from the OpenAI to explain its decision by adding to the prompt: “Build chain-of-thought”, or “Let’s think step-by-step”.

As far as I can see, all the data is already in the database. Isn’t it simpler to do a simple fuzzy query? Fuzzy phrase similarity in SQL Server - Stack Overflow
Is any user interaction with the service expected? If yes, then AI can be interesting to understand speech and do some action. Otherwise, I don’t see the point of using AI.