Hi there community, I hope everyone is doing well ::]
I’m using ada-002 embedding model for a recommendation system (along some other similarity search features like generative playlists), so naturally a lot of questions started to pop.
Before going deeper, let me explain what I am building and how the data is structured:
The prototype is a music app with song recommendations based on users history and songs metadata. I have a table formatted database on postgresql with the users, the artists and the songs. Each of these columns have their own rows, for example song have genres, danceability, number of likes, etc.
There are also two more table models for history logs - a “history” (relating with users and songs) and a “session” (a collection of listened tracks in 25 minutes sessions to know "wich song goes better with another) so I can have better data to embed.
What I need is to vectorize the data, store it on a vector database and then perform a similarity search.
However this is where a lot of questions started to pop - basically I need to know how to format this data to then be embedded and get great results as an output. For example, build a queue of next songs to play based on a song that is currently playing and embedded history/sessions logs of the users, or make a generative playlist with songs that have similar parameters (genre, tempo artists).
[Question 1 - storing data]: To build something like this, how should I store my data? I thought about storing data as key:value parameters for example passing this sentece to the embedding model, would this work after to make similarity search to query like “a playlist based in song 2”?
user: User 1, songs listened in the session: [Song 1, Song 6, Song 2 (...)]
user: User 66, songs listened in the session: [Song 9, Song 2, Song 98 (...)]
and then also store the user and song information as vectors like in my database for some better dimension positioning somehow, like
user Id: 123, username: baby, userPreferences: [rock, metal], userCountry: Australia
song Id: 333, songName: cool song, songGenres: [punk, trash metal], danceability: 0,9
[Question 2 - store just the logs or the products and users as well for better “dimensioning”?]: Do I really need to also embed songs and users database column information as vectors, before storing the history/session vectors of an user logs? Or can I simply store directly the session/history?
[Question 3 - ID everything instead of full text?]: should I use userId, songId instead of the actual song name for cutting embedding prices with less tokens? Will this affect queries? And would this require to also store the producs and users?
[Question 4 - best db recommendation for my case]: When storing this on a vector database, is there a db I should consider would be “better” for my key-value and metadata embedding? I know about a couple, so at this point of time wich one sould be suiting best my needs? Milvus, chroma, weaviate, vespa, pgvector and of course pinecone. Please recommend me at least two that could work for me.
[Question 5 - querying similar songs based on many parameters - song, user and history]: In a context that I stored the embedded songs, the embedded users and the embedded logs of history/session - it’s time to get some very cool query results. Suppose I need now to generate a playlist based on songs similar for genre “punk”, or that many users listened together in a session, or even that have similar danceability. But I also need to take in consideration the user - what does he like and have listened to all time, and how this song can be used to show him some new songs that this user specifically would like? In other words, making a multiple table query not just for songs, but songs related to a song, for a user.
[Question 6 - How will this data be returned?]: In a context where I made a successful query in my vector db, I would need a way to then query those in my database and actually serve the song to the frontend from my backend resolver. I would return something like an array of song ids that are perfectly what I needed. Is it possible?
Better example, suppose the query was to get related songs IDs to song 5 for user Id 777. Considering I have stored all the song data history as key:value pairs but in the same sentence, would it be able to return just songs ids I need in an array? like [song 5, song 888, song 11, song 54] , so I can then query those in my actual sql db?
[Question 7 - migrating models from ada-002 in the future]: Also when openAi release more models like ada-003,4,5,etc would it be compatible with our old vectors or would I need to re-embed everything? Same question, but what if I want to use other embedding models from huggingface? Would the embeddings be completely different from the current ones?
My goal is in the future have some awesome other data to be stored with songs, like instruments, engagement rates and more.
Note: I really don’t want to keep exporting CSVs and training a model everytime a new song or use is added to my platform (for example with older models like word2vec), that’s why I thought about embedding this on the fly using ada-002 on the backend resolvers. Also just using a normal database is not the goal here, because I need this to work this way.
If you feel I’m saying totally nonsense stuff, please enlighten me
Thank you so much! <3