Would embeddings give a performance & cost improvement for my use case?

Hello! I’m working a personal project involving the police logs that my local city puts out in the paper. I have the scraping of all the logs completed, so now what I want to do is be able to categorize the status of a particular police incident throughout it’s movement in the court system. For example, someone gets arrested of a crime, then the crime goes to court, then they may plead guilty or not guilty, eventually they are found innocent or guilty, and if they’re found guilty there will be a sentencing event, and in some instances even an appeal event. Here are some example records from my data to demonstrate this:

  1. was sent to the County Correctional Facility without bail and is scheduled to appear at 10 a.m. Monday in City Court.
  2. entered a plea of not guilty through his attorney <attorney_name>. remains in Jail pending his next appearance in Court on Jan. 21.
  3. “ pled guilty on Feb. 4 to the charges in exchange for a prison sentence of 10 years, according to court documents.”
  4. was arraigned in Court and sent to Jail pending a court appearance today

so ideally what I would want is to assign a status of ‘arrested’ for 1, ‘plead_not_guilty’ for 2, ‘plead_guilty’ for 3, and ‘arrested’ for 4. (1, 2, and 3 are for the same case - 4 is for another, different offense, which I already have a way of tracking). What I’m wondering is if I could use embeddings for this use case and if so, how. Right now what I’m doing to classify these is pretty simple - sending the following query to ChatGPT via the api: ‘Your task is to determine what stage of the criminal justice process the incident is in. The choices are . Please only respond with one of the choices. If the choice cannot be determined, respond ‘undetermined’. INCIDENT START: {incident.legal_actions} INCIDENT END.’

It seems to me there must be a smarter, more performant and cheaper way of approaching this. Would embeddings be a good use case here and if so, how could I go about implementing them appropriately?

Thanks!

I think embeddings could work here.

One simple way to approach this would be to create embeddings of example records that are reflective of the different status types, whereby you create the embedding vector of the record and then add the associated status as metadata value.

Whenever you then want to classify a new record, you just embed the new record and perform a query against your records in the database. From your top match you can then retrieve the metadata value, i.e. the classification.

Depending on the diversity of the records for a given status type, I would embed a few different examples for each status type.

1 Like

Thank you for the response! I understand what you’re saying at a high-level, but not so much at an implementation level. I’m trying to use this to get a start on the code I’d use to do this:

I understand that the equivalent of datafile_path would be something like a CSV of the records I already have, plus a column containing the status that it should be. What I don’t understand is:

  1. how I would then create an embedding of that, and
  2. how would I use that newly created embedding to classify another record.

I’m using Python. Is there a tutorial I can use to figure this out a little better? Get embeddings from dataset | OpenAI Cookbook gives me an idea on how to create the embeddings themselves (after I add in the column that would contain the classification on the record itself), but not necessarily how to then use it.

Thanks again!

1 Like

I personally use a vector DB, so the approach is a little different.

But just to break it down again:

Step 1: Create embeddings of a sample set of existing records. For that you should prepare a csv file. I’d give it three columns: ID, record, classification. You then create embeddings for the records. The created embeddings can be added as an additional column in the CSV file. In principle you can follow the steps outlined in the cookbook you referenced:

Step 2: Perform a similarity search of a new record against the embedded sample records. This involves in a first step to create an embedding of the new record and then to perform the similarity search. You will obtain as a result a list of the top k (e.g. top 3) embedding vectors that are most similar. In your case, you really just would need 1 result in return. The classification associated with vector that is most similar would then be the most likely classification for your new record.

You can take a look at the following cookbook to get an idea of this step (see Section 2 on search).

I only have Python code for the creation of embeddings on the basis of a csv file, i.e. Step 1, that I could share if helpful. However, as said for step 2 I rely on a vector DB.

2 Likes

Thank you again - I appreciate all the explanation! If you wouldn’t mind sharing the code that creates the embeddings, I’d appreciate it :slight_smile:

1 Like

No problem. Here you go. I already made some minor adjustments to accommodate for your case but see comments for where you still need to fill in your specific information. Also, if you want to use a different embedding model, then just replace the model name accordingly.

Python script for embedding creation
import pandas as pd
import openai

openai.api_key = 'YOUR OPENAI KEY'

# Replace by actual file path
file_path = 'C:/Users/UserName/Desktop/Filename.csv'

client = openai.Client(api_key=openai.api_key)

def generate_embeddings(text):
    try:
        response = client.embeddings.create(
            model="text-embedding-ada-002",
            input=text,
            encoding_format="float"
        )
        embedding = response.data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

def process_records(file_path):
    df = pd.read_csv(file_path)  

    # Replace Column_Name by name of column in which records are stored
    if 'Column_Name' not in df.columns:
        print("The expected column 'Column_Name' is not found in the CSV file.")
        return

    # Replace Column_Name by name of column in which records are stored
    embeddings = [generate_embeddings(text) for text in df['Column_Name']]
    df['Embeddings'] = embeddings

    output_path = file_path.rsplit('.', 1)[0] + '_with_embeddings.csv'  
    df.to_csv(output_path, index=False)
    print(f"Embeddings for records have been created. Output saved to '{output_path}'.")

process_records(file_path)
1 Like

Thank you so much! I really appreciate it :slight_smile:

1 Like