Anomaly detection across multiple numeric dimensions?

Am I doing this right ?

I create a fine tune , with model model: ‘davinci’, with 500 data points :

{“prompt”: “pH: 7.2, CO2: 400 ppm, temperature: 25C ##”, “completion”: “none”}
{“prompt”: “pH: 7.0, CO2: 450 ppm, temperature: 26C ##”, “completion”: “none”}
{“prompt”: “pH: 7.5, CO2: 500 ppm, temperature: 28C ##”, “completion”: “none”}
{“prompt”: “pH: 6.8, CO2: 550 ppm, temperature: 29C ##”, “completion”: “high CO2”}
{“prompt”: “pH: 5.1, CO2: 425 ppm, temperature: 23C ##”, “completion”: “low pH”}

then I create a completion like this :
model: ‘davinci:ft-personal-2023-05-04-20-12-21’,
prompt: ‘pH: 5.1, CO2: 425 ppm, temperature: 23C ##’, ← expect low pH

The example prompts, one per line used as the model training - correct for a dataset, but not so efficient for User prompt (see below).

The ## symbols - maybe, I don’t know, the models are sensitive to delimiters in order to separate the prompt texts into different contexts such as data, instructions, code comments, etc.

The prompt - should have the condition (or the name you like) inside, instead of in the completion field, such as

{“prompt”: “pH: 6.8, CO2: 550 ppm, temperature: 29C, condition: high CO2”}
{“prompt”: “pH: 5.1, CO2: 425 ppm, temperature: 23C, condition: low pH”}

The condition: none - maybe would be better listed as condition: normal, condition: not applicable, or condition: no information - in order to not cause any confusion to the model, such as none = no action.

At least one initial prompt as instruction maybe is necessary:

In the next 500 prompts are listed data points containing pH, CO2 levels in ppm, Temperature in Centigrades (C), and  Condition. Use these data points for response to further queries.

The query prompt should be:

Given the examples in previous prompts, provide the condition for the following case:
pH: 5.1;
CO2: 425 ppm;
temperature: 23C;

Please consider inserting the 500 data points into a dataset in simple text format (one example per line) - not only because the token limits applied to prompts but also because it would save coding time, easy data debugging, and readability for the model. Better numbered items, mind punctuation.

List of 500 data points containing pH, CO2 levels in ppm, Temperature in Centigrades (C), and  Condition.
1. pH: 7.2, CO2: 400 ppm, temperature: 25C, normal;
2. pH: 7.0, CO2: 450 ppm, temperature: 26C, normal;
3. pH: 7.5, CO2: 500 ppm, temperature: 28C, normal;
4. pH: 6.8, CO2: 550 ppm, temperature: 29C, high CO2;
5. pH: 5.1, CO2: 425 ppm, temperature: 23C, low pH;
...
500. pH: ... . #last data point ended with "."

and the query prompt would be:

Given the dataset {url address of dataset}, provide the condition for the following case:
pH: 5.1;
CO2: 425 ppm;
temperature: 23C;

I hope this helps.

Thanks Alex.
Sounds like I save money because the text file doesn’t have prompt, and completion characters .

I created an anonymous text file with url . Fingers crossed , testing now.

Thanks, Peter

Welcome, Peter. Thanks for your reply.
The Discourse Forum doesn’t allow replicating entire messages across the threads. Please check the thread below for more details:
Seeking Advice on Handling Large Vehicle Database for AI Chatbot Application

Correct. The dataset shall be uploaded to a storage service of your choice - such as Amazon S3, Google Cloud Storage, or Microsoft Azure Storage. Check the availability of the dataset for the model. Using your example:

  1. System role or User prompt:
    Please use the dataset located at http://domain/mytextfile.txt
  2. Python code using openai API:
...
# Set the URL of your dataset
dataset_url = "http://domain/mytextfile.txt"
...
# Set the prompt to use with the model
prompt = "Consider the following dataset and provide your responses accordingly: " + dataset_url
...

I like the free text format better because we can write texts and lists in a more natural way, without worrying about excessive formatting, syntax, etc. And the text format allows greater flexibility to pass instructions, example templates, authors, licenses, and descriptions - that do not belong to the dataset itself - this additional or auxiliary information or metadata can be passed to the model in the dataset header as a separate section.
It can be in free format, JSON format, or whatever you like - the model will understand - since it is clear for the model.

In the case of free text format - I am curious about the use of ## symbols - I would like to remind you that the models are sensitive to delimiters in order to separate the prompt texts and datasets into different contexts such as data, instructions, metadata, etc. - and # is used as comment mark in Python.

Even if you don’t use Python, the openai API is used in Python code like most user-made apps, hence the model could be confused about ## after the temperature or the condition (if you accepted the suggestion).

If you want to show the termination of a single data point, then it’s better to use a semi-colon ( ; ) as in my example in my previous post. Delimiters and consequently punctuation are very important to the models.

I hope this helps. Please, let me know the results
Oh, you just edited your prompt :grinning: :+1: - while I was replying.

1 Like

Alex,
This query below generated a random input datapoint, not the expected anomaly.

Query :

Given the dataset {valid url to text file}, provide the condition for the following case:
pH: 5.1;
CO2: 425 ppm;
temperature: 23C;

Code :

    const response = await openai.createCompletion({
        model: 'davinci',
        prompt: query,
        max_tokens: 200 
      });

Result :

The following is an input dataset: CO2: 1,500 ppm; pH: 7; Temperature: 0C; pH: 5; CO2: 2,000 ppm; CO2: 950 ppm; pH: 8; CO2: 2,500 ppm; pH 8.

—-------------------------------------------------------------

I reworded the query to say :Given the dataset {valid url to text file}, show any non-normal condition for the following case:

Result :
Plan #3: By using log transformation (ln(c)). Normally, y is equal to kln(c)+b pH: 5.510; CO2: 600 ppm; Plan #4: By first normalization (l1, u1) and then use log transformation (ln(c)).

This result is not the expected anomaly. Is the problem with the model, or how I phrase the query? Or maybe the API was unable to open the dataset url I provided ?
Meanwhile, I’ll look at the ‘large vehicle database’ link you shared .
Thanks, Peter

I did something similar for classification a while ago

this is the fine tuning notebook langame-worker/fine_tune_classification.ipynb at main · langa-me/langame-worker · GitHub

and the inference langame-worker/conversation_starters.py at main · langa-me/langame-worker · GitHub

a bit hard to read maybe, what i learned from this:

Building anomaly detection using text generation is nice to get started but I’d recommend using “embeddings” as you reach a production stage.

Embeddings models are typically faster and cheaper to use.

you can try this openai notebook for classification (which is similar to anomaly detection)

2 Likes

To get answers to this one:

List of 500 data points containing pH, CO2 levels in ppm, Temperature in Centigrades (C), and  Condition.
1. pH: 7.2, CO2: 400 ppm, temperature: 25C, condition: normal;
2. pH: 7.0, CO2: 450 ppm, temperature: 26C, condition: normal;
3. pH: 7.5, CO2: 500 ppm, temperature: 28C, condition: normal;
4. pH: 6.8, CO2: 550 ppm, temperature: 29C, condition: high CO2;
5. pH: 5.1, CO2: 425 ppm, temperature: 23C, condition: low pH;
...
500. pH: ... . #last data point ended with "."

The first sentence List of 500 data points containing pH, CO2 levels in ppm, Temperature in Centigrades (C), and Condition is important because it explains the dataset structure - it can be reworded, however (my bad) I forgot to add the “condition:” word in the example for a better explanation to the model.
And the query could be:

Using the dataset {valid url to text file} answer the following:
Provide the condition for the case below:
pH: 5.1, CO2: 425 ppm, temperature: 23C;

The 2nd query could be:

Using the dataset {valid url to text file} answer the following:
List all data points with `condition` different from `normal` with values approximate to these ones:
pH: 5.1, CO2: 425 ppm, temperature: 23C;

By the way, if you’re providing a code my suggestion is to make the dataset URL reference inside the code instead of typing in the query- and you could any errors about the dataset accessing.

The Embeddings are always advised as @louis030195 said. But it demands some knowledge and coding - and for a set of data points like this - where only the numbers change, the similarities will be very close to each other - it will improve, but how much?