I recently went through the tutorial on Web Q&A embeddings provided by OpenAI (Web Q&A - OpenAI API). While navigating through it, I encountered several challenges, likely due to the fact that it hasn’t been updated since the release of version 1.0.0. However, I successfully managed to work through these issues. At present, the only hiccup I’m facing is that it appears the embedded data isn’t being integrated properly. Below is the code I’m using to execute the call:
The current issue I’m encountering is that the system is unable to identify the latest embedding model, a capability that, according to the tutorial, should function seamlessly. This unexpected difficulty is something I’m looking to resolve.
Thanks for quick response, as a fresh account I could only attach one file to my post, here is the output for the 3 example questions as posted in the tutorial:
Ok, there seems to be a disconnect, the AI is not aware of your data, you need to make use of either the OpenAI retrieval system or your own vector database and perform a retrieval on that.
Have you followed every step to the letter in the example on eb Q&A - OpenAI API?
It seems like you may have missed a large section of it out.
I have followed the whole tutorial, but as it hasn’t been updated after the release of 1.0.0, I had to change few lines. Notably the embedding_utils didn’t work so I have changed the create_context function a bit
def create_context(
question, df, max_len=1800, size="ada"
):
# Get the embeddings for the question
q_embeddings = openai.embeddings.create(input='test', model='text-embedding-ada-002').data[0].embedding
# Calculate the distances from the embeddings
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
df['distances'] = 1 - np.array([cosine_similarity(q_embeddings, emb) for emb in df['embeddings'].values])
returns = []
cur_len = 0
# Sort by distance and add the text to the context until the context is too long
for i, row in df.sort_values('distances', ascending=True).iterrows():
# Add the length of the text to the current length
cur_len += row['n_tokens'] + 4
# If the context is too long, break
if cur_len > max_len:
break
# Else add it to the text that is being returned
returns.append(row["text"])
# Return the context
return "\n\n###\n\n".join(returns)
I have run the create_context function on its own and the result is:
You need to change ascending to False, as the cosine similarity measure increases with the similarity. Therefore, you want to add context with high similarity measure first, and sort the df in descending order based on the similarity measure.