I have a database of articles and their authors. I am building a recommendation system to recommend authors that generally write articles about similar topics as the topics of a given input text. To do so I calculate the embeddings of all articles and then calculate an embedding for each author by taking the average of the embeddings of the authors last 30 articles and store these embeddings in a database. To find authors that generally write about the same topic as the given input text, I compare the embedding of the input text to the (aggregated) author embeddings.
It works quite well for non-niche topics and less well for niche topics and texts that dont have a very well-defined single topic. My theory is that by taking the average embedding of an authors work it flattens the semantic representation a bit. So the aggregated embedding captures very well whether the author writes about a broad category such as tech, or politics, or science, but because its an aggregated average embedding of an authors work, when the input text is about a niche tech topic a tech-author who wrote a few articles about that niche topic will often not end up above other tech authors. Do you have any tips for improving my method of recomendations?
I’m also curious whether there is a method to translate the embeddings in such a way that it mostly captures topical relevance only and not things like writing style, because currently the cosine similarity scores are very close to eachother (mostly between 0.79 and 0.9 where the most relevant authors in most cases have a score around 0.87. It would be easier to define a “good enough match” cut off point if the scores were further apart and having only the dimensions that are related to the topics within a text may make the embeddings smaller and thus take up less space.
I am using the cosine similarity function built-in the pinecone database.