Hey, ill get straight to the point. Im trying to interlink my 2000 words article with other posts on my blog. I have;
- article 2000 words
- posts dictionary {‘post name 1’: ‘url’, ‘post name 2’: ‘url2’, ‘post name 3’: 'url3…}
- the code here is a glimpse
article = "My 2000 words long article"
sentence_threshold = 0.8
post_threshold = 0.8
sentences = re.split('[\.\?!][\s]*|[-–—][\s]*|[:][\s]*|[;][\s]*', article)
# Encode the sentences into embeddings
embeddings = []
token_usage = 0
#for sentence in sentences:
result = openai.Embedding.create(
input=article,
model="text-embedding-ada-002"
)
token_usage += result['usage']['total_tokens']
embedding = np.array(result['data'][0]['embedding'])
embeddings.append(embedding)
# Normalize the embeddings
embeddings = [embedding / np.linalg.norm(embedding) for embedding in embeddings]
So here is the thing that doesnt work, if I do it in a for loop sentence by sentence it does work! but its really slow but after around 2 minutes I got what I need (inside a for sentence loop im checking for similarity score of my post name with a sentence using cosine_similarity(embeddings[i], post_embedding))
If I leave the api calls in a for loop I get equal sized embedding array which is 191 for the article Im putting, but if I do it like I wrote above all I get is len(embedding) = 1 and I know it should be 191 like;
Here is what I get in embeddings without for loop
[array([ 0.0403676 , 0.01448165, 0.00766105, ..., -0.00764166,
-0.01843824, -0.05047889])]
Here is what I get in embeddings if I do it in a loop
..............array([ 0.01812073, 0.02399307, 0.00866932, ..., 0.01945957,
-0.02510656, -0.02910983]), array([-0.00726626, 0.00626179, 0.01283683, ..., 0.00158162,
-0.02055255, 0.00497129]), array([ 0.00157823, -0.01674882, -0.00096651, ..., -0.01041143,
-0.02104307, -0.00485398]), array([ 0.00900183, 0.026613 , 0.01325586, ..., 0.00724198,
-0.0328168 , -0.01881395]), array([-0.00390678, 0.01128842, 0.00595922, ..., 0.00758623,
-0.02097905, -0.03504735]), array([ 0.00855317, 0.00015971, -0.01988218, ..., -0.03186072,
-0.00013428, -0.00918901]), array([ 0.01568328, 0.02055611, -0.00469747, ..., 0.01629708,
-0.00105067, -0.03717888]), array([ 0.01740627, 0.01084242, -0.00490702, ..., 0.00754145,
-0.0221292 , -0.01985661]), array([ 0.00928457, 0.00845914, -0.01899117, ..., 0.00409924,
-0.00609455, -0.02991419]), array([-0.02048408, -0.01166611, -0.00076245, ..., -0.01609263,
-0.01271837, -0.03743256]), array([ 0.00157823, -0.01674882, -0.00096651, ..., -0.01041143,
-0.02104307, -0.00485398]), array([ 0.01812073, 0.02399307, 0.00866932, ..., 0.01945957,
-0.02510656, -0.02910983]), array([ 0.0062621 , 0.00690462, -0.01560698, ..., -0.01261076,
-0.02283785, 0.00688465]), array([ 0.02010468, 0.01750666, 0.00929292, ..., 0.01320993,
-0.00211339, -0.01927864]), array([ 0.00156904, -0.01674868, -0.00096192, ..., -0.01039911,
-0.02101843, -0.00484476]), array([ 0.01360712, 0.02565403, 0.00674601, ..., 0.00331865,
-0.0283908 , -0.02089665]), array([ 0.00446662, 0.01857241, 0.01074299, ..., 0.01560749,
0.00375427, -0.04661712]), array([ 0.00637931, 0.00094998, -0.00290306, ..., -0.02211853,
0.00171284, 0.01383757]), array([ 0.01568328, 0.02055611, -0.00469747, ..., 0.01629708,
-0.00105067, -0.03717888]), array([ 0.00672113, 0.00181576, -0.00431672, ..., 0.00419525,
-0.02026928, -0.02367033]), array([ 0.01861821, 0.01351544, -0.02494965, ..., 0.01967136,
-0.01144675, -0.0279085 ]), array([-0.02048408, -0.01166611, -0.00076245, ..., -0.01609263,
-0.01271837, -0.03743256]), array([ 0.00157823, -0.01674882, -0.00096651, ..., -0.01041143,
-0.02104307, -0.00485398]), array([ 0.01812073, 0.02399307, 0.00866932, ..., 0.01945957,
-0.02510656, -0.02910983]), array([ 0.00623226, 0.00691807, -0.01555401, ..., -0.01271087,
-0.02282497, 0.00684483]), array([ 0.02010431, 0.01750634, 0.00932606, ..., 0.01316972,
-0.00203674, -0.01925165]), array([ 0.00156904, -0.01674868, -0.00096192, ..., -0.01039911,
-0.02101843, -0.00484476]), array([ 0.01852282, 0.03004047, 0.00095315, ..., 0.01771749,
-0.0325204 , -0.0163369 ]), array([ 0.00167578, 0.00404019, 0.01134665, ..., 0.01454342,
-0.01602177, -0.03636486]), array([ 0.04229578, 0.01394077, 0.0025843 , ..., -0.00790416,
-0.01749173, -0.04013891]), array([ 0.01387062, -0.00419769, -0.01793063, ..., -0.00872837,
-0.00889487, -0.02032565]), array([ 0.0004801 , 0.00510512, -0.00260057, ..., -0.01248273,
-0.01914018, -0.03881808]), array([ 0.02571821, 0.00874419, -0.00944906, ..., -0.00654703,
0.00743605, -0.03200488]), array([ 0.01169694, 0.01528149, -0.00526362, ..., -0.01171581,
-0.00042645, -0.01508025]), array([ 0.00156904, -0.01674868, -0.00096192, ..., -0.01039911,
-0.02101843, -0.00484476])]
The
print(len(embeddings[0])) returns 1536 and idk why?
I hate numbers…
This is how its supposed to look after sending whole article in 1 api request, anyone have an idea how I can accomplish this? Rest of the code works fine with this array of arrays and cba making it work with 1 because Ive spent 3 days on it… thanks a lot for any genius that can help me out!