Web Q&A embeddings - turorial

Hello everyone,

I recently went through the tutorial on Web Q&A embeddings provided by OpenAI (Web Q&A - OpenAI API). While navigating through it, I encountered several challenges, likely due to the fact that it hasn’t been updated since the release of version 1.0.0. However, I successfully managed to work through these issues. At present, the only hiccup I’m facing is that it appears the embedded data isn’t being integrated properly. Below is the code I’m using to execute the call:

The current issue I’m encountering is that the system is unable to identify the latest embedding model, a capability that, according to the tutorial, should function seamlessly. This unexpected difficulty is something I’m looking to resolve.

Any help is apricated, thanks in advance!

Hi and welcome to the Developer Forum!

Can you provide logs of the results? Try to include data that shows what was sent in and what was pulled back and any error messages you may receive.

Hi Foxablio,

Thanks for quick response, as a fresh account I could only attach one file to my post, here is the output for the 3 example questions as posted in the tutorial:

It seems like the model is not able to access the embedded data

Ok, there seems to be a disconnect, the AI is not aware of your data, you need to make use of either the OpenAI retrieval system or your own vector database and perform a retrieval on that.

Have you followed every step to the letter in the example on eb Q&A - OpenAI API?

It seems like you may have missed a large section of it out.

I have followed the whole tutorial, but as it hasn’t been updated after the release of 1.0.0, I had to change few lines. Notably the embedding_utils didn’t work so I have changed the create_context function a bit

def create_context(
    question, df, max_len=1800, size="ada"
):
    # Get the embeddings for the question
    q_embeddings = openai.embeddings.create(input='test', model='text-embedding-ada-002').data[0].embedding

    # Calculate the distances from the embeddings
    def cosine_similarity(a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    df['distances'] = 1 - np.array([cosine_similarity(q_embeddings, emb) for emb in df['embeddings'].values])


    returns = []
    cur_len = 0

    # Sort by distance and add the text to the context until the context is too long
    for i, row in df.sort_values('distances', ascending=True).iterrows():

        # Add the length of the text to the current length
        cur_len += row['n_tokens'] + 4

        # If the context is too long, break
        if cur_len > max_len:
            break

        # Else add it to the text that is being returned
        returns.append(row["text"])

    # Return the context
    return "\n\n###\n\n".join(returns)

I have run the create_context function on its own and the result is:

Any suggestions on what went wrong?

You need to change ascending to False, as the cosine similarity measure increases with the similarity. Therefore, you want to add context with high similarity measure first, and sort the df in descending order based on the similarity measure.

1 Like