Would embeddings give a performance & cost improvement for my use case?

nicktucker4 · May 11, 2024, 6:00pm

Hello! I’m working a personal project involving the police logs that my local city puts out in the paper. I have the scraping of all the logs completed, so now what I want to do is be able to categorize the status of a particular police incident throughout it’s movement in the court system. For example, someone gets arrested of a crime, then the crime goes to court, then they may plead guilty or not guilty, eventually they are found innocent or guilty, and if they’re found guilty there will be a sentencing event, and in some instances even an appeal event. Here are some example records from my data to demonstrate this:

was sent to the County Correctional Facility without bail and is scheduled to appear at 10 a.m. Monday in City Court.
entered a plea of not guilty through his attorney <attorney_name>. remains in Jail pending his next appearance in Court on Jan. 21.
“ pled guilty on Feb. 4 to the charges in exchange for a prison sentence of 10 years, according to court documents.”
was arraigned in Court and sent to Jail pending a court appearance today

so ideally what I would want is to assign a status of ‘arrested’ for 1, ‘plead_not_guilty’ for 2, ‘plead_guilty’ for 3, and ‘arrested’ for 4. (1, 2, and 3 are for the same case - 4 is for another, different offense, which I already have a way of tracking). What I’m wondering is if I could use embeddings for this use case and if so, how. Right now what I’m doing to classify these is pretty simple - sending the following query to ChatGPT via the api: ‘Your task is to determine what stage of the criminal justice process the incident is in. The choices are . Please only respond with one of the choices. If the choice cannot be determined, respond ‘undetermined’. INCIDENT START: {incident.legal_actions} INCIDENT END.’

It seems to me there must be a smarter, more performant and cheaper way of approaching this. Would embeddings be a good use case here and if so, how could I go about implementing them appropriately?

Thanks!

jr.2509 · May 11, 2024, 6:19pm

I think embeddings could work here.

One simple way to approach this would be to create embeddings of example records that are reflective of the different status types, whereby you create the embedding vector of the record and then add the associated status as metadata value.

Whenever you then want to classify a new record, you just embed the new record and perform a query against your records in the database. From your top match you can then retrieve the metadata value, i.e. the classification.

Depending on the diversity of the records for a given status type, I would embed a few different examples for each status type.

nicktucker4 · May 11, 2024, 6:55pm

Thank you for the response! I understand what you’re saying at a high-level, but not so much at an implementation level. I’m trying to use this to get a start on the code I’d use to do this:

I understand that the equivalent of datafile_path would be something like a CSV of the records I already have, plus a column containing the status that it should be. What I don’t understand is:

how I would then create an embedding of that, and
how would I use that newly created embedding to classify another record.

I’m using Python. Is there a tutorial I can use to figure this out a little better? Get embeddings from dataset | OpenAI Cookbook gives me an idea on how to create the embeddings themselves (after I add in the column that would contain the classification on the record itself), but not necessarily how to then use it.

Thanks again!

jr.2509 · May 11, 2024, 7:15pm

I personally use a vector DB, so the approach is a little different.

But just to break it down again:

Step 1: Create embeddings of a sample set of existing records. For that you should prepare a csv file. I’d give it three columns: ID, record, classification. You then create embeddings for the records. The created embeddings can be added as an additional column in the CSV file. In principle you can follow the steps outlined in the cookbook you referenced:

Step 2: Perform a similarity search of a new record against the embedded sample records. This involves in a first step to create an embedding of the new record and then to perform the similarity search. You will obtain as a result a list of the top k (e.g. top 3) embedding vectors that are most similar. In your case, you really just would need 1 result in return. The classification associated with vector that is most similar would then be the most likely classification for your new record.

You can take a look at the following cookbook to get an idea of this step (see Section 2 on search).

I only have Python code for the creation of embeddings on the basis of a csv file, i.e. Step 1, that I could share if helpful. However, as said for step 2 I rely on a vector DB.

nicktucker4 · May 11, 2024, 8:56pm

Thank you again - I appreciate all the explanation! If you wouldn’t mind sharing the code that creates the embeddings, I’d appreciate it

jr.2509 · May 12, 2024, 5:20am

No problem. Here you go. I already made some minor adjustments to accommodate for your case but see comments for where you still need to fill in your specific information. Also, if you want to use a different embedding model, then just replace the model name accordingly.

Python script for embedding creation

import pandas as pd
import openai

openai.api_key = 'YOUR OPENAI KEY'

# Replace by actual file path
file_path = 'C:/Users/UserName/Desktop/Filename.csv'

client = openai.Client(api_key=openai.api_key)

def generate_embeddings(text):
    try:
        response = client.embeddings.create(
            model="text-embedding-ada-002",
            input=text,
            encoding_format="float"
        )
        embedding = response.data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

def process_records(file_path):
    df = pd.read_csv(file_path)  

    # Replace Column_Name by name of column in which records are stored
    if 'Column_Name' not in df.columns:
        print("The expected column 'Column_Name' is not found in the CSV file.")
        return

    # Replace Column_Name by name of column in which records are stored
    embeddings = [generate_embeddings(text) for text in df['Column_Name']]
    df['Embeddings'] = embeddings

    output_path = file_path.rsplit('.', 1)[0] + '_with_embeddings.csv'  
    df.to_csv(output_path, index=False)
    print(f"Embeddings for records have been created. Output saved to '{output_path}'.")

process_records(file_path)

nicktucker4 · May 12, 2024, 6:10pm

Thank you so much! I really appreciate it

Topic		Replies	Views
Improve fine tuning by adding embedding API	7	2286	April 26, 2023
Feeding data then ask questions about it API	1	1380	February 28, 2024
Preprocessing - I just don’t get it! API	15	2992	January 3, 2024
Embedding and searching from similar embeddings API	6	6502	October 27, 2023
Can GPT3 do some classification base on the content of the sentences? API	9	2080	January 25, 2022

Would embeddings give a performance & cost improvement for my use case?

Related topics