How to load saved embeddings

Hi,

my problem, besides that I do not know python, is that I have saved embeddings, looking like:

0,0.0031115561723709106,0.00018902790907304734,-0.00190595886670053,-0.029547588899731636,-0.022286130115389824,0.018968993797898293,-0.029436087235808372,-0.0378822386264801,-0.02245337888598442,-0.018146678805351257,0.02940821275115013,-0.020348811522126198,-0.009881717152893543,-0.008892151527106762 …

with rows of index and vectors, looks ok, I saved this after computing them, if I use the computed ones directly works fine.

But to save time of course I want to load them from file but it does’n work.

If I have the function so:

def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:

    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns])
    return {
           (i): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

then I get : ValueError: invalid literal for int() with base 10: ‘Unnamed: 0’

I did try to add after df = pd.read_csv(fname, header=0)

df = df.set_axis(['column1_name', 'column2_name, column3_name]), axis=1)

but then I get:
ValueError: invalid literal for int() with base 10: ‘Unnamed: 0’

I mean I see the embeddings format in the file it has the vectors so indeed it does not seems ok to me to load it this way but how, sorry I do not know python just try :slight_smile:

Saved so:

import os

saved_embeddings = "path to file"
if os.path.exists(saved_embeddings) :
    document_embeddings = load_embeddings(saved_embeddings)
else :
    # document_embeddings = compute_embedding_with_backoff(df=df)
    document_embeddings = compute_doc_embeddings(df)
    pd.DataFrame(document_embeddings).T.to_csv(saved_embeddings)

I had the same issue too. The way I fixed this was to add the column names that you used in your dataframe index. ie:

df.set_index(["title", "heading"])

So I had to add title and heading to the first 2 values in my embeddings file.

the top of my embeddings file looks something like this:
,,0,1,2,3,4,5,6,7,8,9,1

and you need to it to be:
title,heading,0,1,2,3,4,5,6,7,8,9,1

thanks for the response but …
I load the data as such:

df = pd.read_csv('luna_skills_copy.csv')

columns = ["name", "skill", "mastery"]

df = df.reindex(columns=columns)

print(f"{len(df)} rows in the data.")

df.sample(5)

as I want to use only those 3 columns, then I compute embeddings so:

def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.skill + ' ' + r.mastery) for idx, r in df.iterrows()
    }

because I need to check and find people with certain skill and also level sometimes (it works)

but my embedding has first line with only first element missing:

,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60, …

So if I set the index for/with first column:

df.set_index(["name"])

I will get the error:

ValueError: invalid literal for int() with base 10: 'name'

Can you upload your luna_skills_copy.csv and the resulting embeddings csv?

skill file copy

embediings

Here it is, thanks again!

It looks like you did not set an index when creating the embeddings file (which is fine). Therefore, the first column of the embeddings file is the index(row number) of the original csv file. When your embedding file is generated, put a place holder column called idx like this:

idx,0,1,2,3,4,5,6,7,8,9,

and then when loading the file, you need to ignore that column header when computing the max_dim.

Here is my python code, let me know if you have any questions.

import numpy as np
import openai
import pandas as pd
import pickle
import tiktoken

COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

MAX_SECTION_LEN = 500
MAX_SECTIONS = 3
SEPARATOR = "\n* "
ENCODING = "gpt2"
encoding = tiktoken.get_encoding(ENCODING)
separator_len = len(encoding.encode(SEPARATOR))


def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c)
                  for c in df.columns if c != "idx"])
    return {
        (r.idx): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }


def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    """
    Fetch relevant
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(
        question, context_embeddings)

    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []

    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.
        document_section = df.loc[section_index]
        print(type(document_section))
        if hasattr(document_section, 'tokens'):
            chosen_sections_len += document_section.tokens + separator_len
            if chosen_sections_len > MAX_SECTION_LEN:
                break
        elif len(chosen_sections) > MAX_SECTIONS:
            break

        print(document_section.skill + ' ' + document_section.mastery +
              ' ' + document_section.authorname)
        chosen_sections.append(
            SEPARATOR + document_section.skill + ' ' + document_section.mastery + ' ' + document_section.authorname)
        # print('-----------')
        chosen_sections_indexes.append(str(section_index))

    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))

    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""

    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"


def get_embedding(text: str, model: str = EMBEDDING_MODEL) -> list[float]:
    result = openai.Embedding.create(
        model=model,
        input=text
    )
    return result["data"][0]["embedding"]


def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.

    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        # idx: get_embedding(r.title) for idx, r in df.iterrows()
        idx: get_embedding(r.skill + ' ' + r.mastery) for idx, r in df.iterrows()
    }


def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.

    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))


def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections.

    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)

    return document_similarities


COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}


def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[tuple[str, str], np.array],
    show_prompt: bool = False
) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    print(prompt)
    quit()
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
        prompt=prompt,
        **COMPLETIONS_API_PARAMS
    )
    print(response["choices"])
    return response["choices"][0]["text"].strip(" \n")


df = pd.read_csv('luna_skills_copy.csv')
print(f"{len(df)} rows in the data.")
# print(df.sample(15))
print(df)

# save embeddings to file
'''
document_embeddings = compute_doc_embeddings(df)
example_entry = list(document_embeddings.items())[0]
print(
    f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")
pd.DataFrame(document_embeddings).T.to_csv(
    'luna_skills_copy_embeddings_small2.csv')

quit()
'''
print('loading ...')
document_embeddings = load_embeddings("luna_skills_copy_embeddings.csv")
print('loading done.')

answer = answer_query_with_context(
    "ASP.NET", df, document_embeddings)
print(answer)

1 Like

datafile_attr = “xyz.csv”
df_a = pd.read_csv(datafile_attr)
df_a[“Embedding”] = df_a[“Embedding”].apply(eval).apply(np.array)

The embeddings column in the csv is the output i get from the text-ada-002. I have been using these with cosine similarity and it’s been working well for me

op = cosine_similarity(df.loc[i, ‘Embedding’], l_e)

Thanks again for the effort, I replaced the functions with yours, load_embeddings also, now I get this:

Cell In[258], line 3, in load_embeddings(fname)
      1 def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
      2     df = pd.read_csv(fname, header=0)
----> 3     max_dim = max([int(c)
      4                   for c in df.columns if c != "idx"])
      5     return {
      6         (r.idx): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
      7     }

Cell In[258], line 3, in <listcomp>(.0)
      1 def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
      2     df = pd.read_csv(fname, header=0)
----> 3     max_dim = max([int(c)
      4                   for c in df.columns if c != "idx"])
      5     return {
      6         (r.idx): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
      7     }

ValueError: invalid literal for int() with base 10: 'Unnamed: 0'

i haven’t had a look at the data you have used, but the line of code I have attached is purely for the embedding you get back for text from OpenAI. Technically, it being a list of integers, there should not be a string in there

Did you update the first line in your embeddings file too?