Help me with embeddings! Interlinking my posts with python and openai embeddings

Hey, ill get straight to the point. Im trying to interlink my 2000 words article with other posts on my blog. I have;

  1. article 2000 words
  2. posts dictionary {‘post name 1’: ‘url’, ‘post name 2’: ‘url2’, ‘post name 3’: 'url3…}
  3. the code here is a glimpse
    article = "My 2000 words long article"
    sentence_threshold = 0.8
    post_threshold = 0.8

    sentences = re.split('[\.\?!][\s]*|[-–—][\s]*|[:][\s]*|[;][\s]*', article)

    
    # Encode the sentences into embeddings
    embeddings = []
    token_usage = 0
    #for sentence in sentences:
    result = openai.Embedding.create(
            input=article,
            model="text-embedding-ada-002"
        )
    token_usage += result['usage']['total_tokens']
    
    embedding = np.array(result['data'][0]['embedding'])
    embeddings.append(embedding)

    # Normalize the embeddings
    embeddings = [embedding / np.linalg.norm(embedding) for embedding in embeddings]

So here is the thing that doesnt work, if I do it in a for loop sentence by sentence it does work! but its really slow but after around 2 minutes I got what I need (inside a for sentence loop im checking for similarity score of my post name with a sentence using cosine_similarity(embeddings[i], post_embedding))

If I leave the api calls in a for loop I get equal sized embedding array which is 191 for the article Im putting, but if I do it like I wrote above all I get is len(embedding) = 1 and I know it should be 191 like;

Here is what I get in embeddings without for loop

[array([ 0.0403676 ,  0.01448165,  0.00766105, ..., -0.00764166,
       -0.01843824, -0.05047889])]

Here is what I get in embeddings if I do it in a loop

..............array([ 0.01812073,  0.02399307,  0.00866932, ...,  0.01945957,
       -0.02510656, -0.02910983]), array([-0.00726626,  0.00626179,  0.01283683, ...,  0.00158162,
       -0.02055255,  0.00497129]), array([ 0.00157823, -0.01674882, -0.00096651, ..., -0.01041143,
       -0.02104307, -0.00485398]), array([ 0.00900183,  0.026613  ,  0.01325586, ...,  0.00724198,
       -0.0328168 , -0.01881395]), array([-0.00390678,  0.01128842,  0.00595922, ...,  0.00758623,
       -0.02097905, -0.03504735]), array([ 0.00855317,  0.00015971, -0.01988218, ..., -0.03186072,
       -0.00013428, -0.00918901]), array([ 0.01568328,  0.02055611, -0.00469747, ...,  0.01629708,
       -0.00105067, -0.03717888]), array([ 0.01740627,  0.01084242, -0.00490702, ...,  0.00754145,
       -0.0221292 , -0.01985661]), array([ 0.00928457,  0.00845914, -0.01899117, ...,  0.00409924,
       -0.00609455, -0.02991419]), array([-0.02048408, -0.01166611, -0.00076245, ..., -0.01609263,
       -0.01271837, -0.03743256]), array([ 0.00157823, -0.01674882, -0.00096651, ..., -0.01041143,
       -0.02104307, -0.00485398]), array([ 0.01812073,  0.02399307,  0.00866932, ...,  0.01945957,
       -0.02510656, -0.02910983]), array([ 0.0062621 ,  0.00690462, -0.01560698, ..., -0.01261076,
       -0.02283785,  0.00688465]), array([ 0.02010468,  0.01750666,  0.00929292, ...,  0.01320993,
       -0.00211339, -0.01927864]), array([ 0.00156904, -0.01674868, -0.00096192, ..., -0.01039911,
       -0.02101843, -0.00484476]), array([ 0.01360712,  0.02565403,  0.00674601, ...,  0.00331865,
       -0.0283908 , -0.02089665]), array([ 0.00446662,  0.01857241,  0.01074299, ...,  0.01560749,
        0.00375427, -0.04661712]), array([ 0.00637931,  0.00094998, -0.00290306, ..., -0.02211853,
        0.00171284,  0.01383757]), array([ 0.01568328,  0.02055611, -0.00469747, ...,  0.01629708,
       -0.00105067, -0.03717888]), array([ 0.00672113,  0.00181576, -0.00431672, ...,  0.00419525,
       -0.02026928, -0.02367033]), array([ 0.01861821,  0.01351544, -0.02494965, ...,  0.01967136,
       -0.01144675, -0.0279085 ]), array([-0.02048408, -0.01166611, -0.00076245, ..., -0.01609263,
       -0.01271837, -0.03743256]), array([ 0.00157823, -0.01674882, -0.00096651, ..., -0.01041143,
       -0.02104307, -0.00485398]), array([ 0.01812073,  0.02399307,  0.00866932, ...,  0.01945957,
       -0.02510656, -0.02910983]), array([ 0.00623226,  0.00691807, -0.01555401, ..., -0.01271087,
       -0.02282497,  0.00684483]), array([ 0.02010431,  0.01750634,  0.00932606, ...,  0.01316972,
       -0.00203674, -0.01925165]), array([ 0.00156904, -0.01674868, -0.00096192, ..., -0.01039911,
       -0.02101843, -0.00484476]), array([ 0.01852282,  0.03004047,  0.00095315, ...,  0.01771749,
       -0.0325204 , -0.0163369 ]), array([ 0.00167578,  0.00404019,  0.01134665, ...,  0.01454342,
       -0.01602177, -0.03636486]), array([ 0.04229578,  0.01394077,  0.0025843 , ..., -0.00790416,
       -0.01749173, -0.04013891]), array([ 0.01387062, -0.00419769, -0.01793063, ..., -0.00872837,
       -0.00889487, -0.02032565]), array([ 0.0004801 ,  0.00510512, -0.00260057, ..., -0.01248273,
       -0.01914018, -0.03881808]), array([ 0.02571821,  0.00874419, -0.00944906, ..., -0.00654703,
        0.00743605, -0.03200488]), array([ 0.01169694,  0.01528149, -0.00526362, ..., -0.01171581,
       -0.00042645, -0.01508025]), array([ 0.00156904, -0.01674868, -0.00096192, ..., -0.01039911,
       -0.02101843, -0.00484476])]

The

print(len(embeddings[0])) returns 1536 and idk why?

I hate numbers…

This is how its supposed to look after sending whole article in 1 api request, anyone have an idea how I can accomplish this? Rest of the code works fine with this array of arrays and cba making it work with 1 because Ive spent 3 days on it… thanks a lot for any genius that can help me out! :smiley:

I built a WordPress plugin that does this.

One this I found that embedding sentences wasn’t as effective as paragraphs. My thought was that the added context made better matches.

However, it seems like your problem is more of a programming competency issue than AI system strategy issue. And that, unfortunately I cannot help you with directly.

But if you’re interested, I might consider publishing that plugin as open-source.

In the meantime, I recommend working on your article parsing strategy. Try parsing into paragraphs whilst including the relevant headings per embed input. And also try persisting your Embeddings in a database so you can experiment with various calculations.

2 Likes

Hi @makaroni94

@wfhbrian is right. There’s something wrong in your code when you run for the entire article. I cannot point it out because your code is ambiguous.

Can you run it for just the whole article and share the output of the embedding obtained and the code you ran?

1 Like

My code works in polish language, sorry but there is no point in me sharing the code I would have to paste everything here and its too much.

The output Im receiving (when looping over every sentence) is good for me here it is (its in polish tho)

{'Fotel gamingowy do 1000 zł': 'Fotel gamingowy do 200 kg\n\n<h2>Dlaczego fotel gamingowy dla graczy o wadze powyżej 200 kg jest ważny:0.8507698673064766', 
'Fotel gamingowy do 700 zł': 'Fotel gamingowy dla graczy o wadze powyżej 200 kg powinien zapewnić odpowiednie wsparcie dla ciała, aby zapobiec bólowi i dyskomfortowi podczas długiej sesji grania:0.859752636647331', 
'Fotel na kółkach dla inwalidów': 'Niektóre fotele gamingowe dla osób o wadze powyżej 200 kg mają dodatkowe poduszki na zagłówek i podparcie lędźwiowe, które również mogą zapewnić dodatkowe wsparcie i wygodę:0.8523221047838959', 
'Fotel gamingowy do 800 zł': 'Warto również pamiętać, że dobry fotel gamingowy dla osób o wadze powyżej 200 kg to inwestycja na długie lata:0.8509398427166032', 
'Fotel gamingowy do 100 zł': 'Fotel gamingowy dla osób o wadze powyżej 200 kg powinien umożliwiać regulację kąta nachylenia oparcia:0.8596210227265936', 
'Fotelik dziecięcy': 'Fotel dedykowany jest dla osób o wadze do 150 kg:0.8516857810737982', 
'Tablet do 1200 zł': 'około 1 500 PLN\n\nTabela porównawcza:0.8678484520434757'}

Where numbers are obviously similarity obtained from cosine_similarity function from openai.

I just need help on how I can accomplish same thing without looping over every sentence and just doing 1 api call where I send whole article.

Is it available on wordpress plugins? Whats its name if it is?

You are using embedings to get a relevant match in the article with the post you are trying to link with your plugin, right?

I know there are different ways on doing what Im trying to accomplish, like keyword extraction, but what Im trying to do is to also replace existing string → send it to gpt 3.5 turbo and ask it to rephrase that string with the post name → retrieve it and replace in the article.

This way I get a pretty good sentences with the link to relevant posts all of this for the seo benefits of course…

My plugin isn’t publicly available. And it doesn’t automatically replace anything. Instead, it provides a dashboard with internal link recommendations.

But are you just asking how to batch the requests? If so, in your example, you should be able to replace the article string with an array of strings.

Hey, is it out for public use? Can you please share the plugin with me?

I might have another one for you to test, but it’s not available on wp.org repo. You can see details about this one here WordPress Content Suggestions (Related Posts) - need beta testers

Last month the website has 2.5M unique visitors and with all customisations to the user browsing history OpenAI API costs: around 30 USD, Weaviate cloud service: 78 USD…