Fine-Tuning stats show good results but fails in practice


I am attempting to create a fine-tuned model for detecting phishing. I downloaded the DOM of a website and extracted various features. These features are passed to a prompt with the verdict as the completion.

Example prompt:

{"prompt": "', '[]-$%&\n", "completion":" ph-$%&"}

The first problem I encountered was that it returned more options than “clean” or “ph” as the completion. I added stop values and forced the return of “clean” and “ph” completions with logit_bias.

I use the Ada model and these are my parameters:

            logit_bias={27773: 100, 746: 100},

I have over 1000 phishing examples and more than 600 clean examples. The result.csv file contains a score of 1.0 for every metric. However, in reality, I get a probability of >95% for nearly every website that this is phishing, even for clean files from the training set.

Do I misinterpret the results, or are my parameters the cause of this problem?

Side question: With embeddings, is it possible to use the entire DOM for training instead of extracted features? Also, is it possible to pass the entire DOM to get a verdict?

Hi @DuckLover and welcome to the community!

Unless I’m missing a whole lot of other steps you did, this doesn’t seem like “fine-tuning”. How many epochs did your fine-tuned model training reach?

Hello Bill, thank you for your reply.
It may be a simple example but the idea is that there are features in phishing websites that are not in clean websites. I am trying to fine-tune the model that it can detect those.
The training completed 4/4 epoches.

I’m not an authority on fine-tuning, but I have gleaned from other experts that you may need to increase the epochs before you will begin to see good results from your model.

Just for clarification: I have over 2000 lines in jsonl file. This was just one line out of it as example. I read a lot of posts here and never saw the recommendation for more epoches with this amount of data. But feel free to correct me if I am wrong with any of my assumption.
But after 4 epoches the results are perfect on the paper. As I mentioned, I get for every metric 1.0 score. This is the confusing part here. It does not really represent the results when I use my model.

I think 95% accuracy is pretty good. You could try training Babbage on the same data to get a few more percentage points improvement. If that doesn’t show improvement, you can increase the size of your training data. If you are meaning you get 95% on all websites, then you need to delineate more clean data in training, and better examples of clean that aren’t ambiguous (aren’t simply subsets of phishing).

Also, use this with another techniques, like using regex to determine that there are really two websites detected (from the training example above you gave). Doing the classifier and regex in combination could close the gap without a re-train.

With embeddings, you can extract out parts of the DOM that score highly with phishing, and then use this as a pre-filter to your classifier, or outright trust a high embedding correlation to phishing as its own classifier.

Example of embedding as a classifier: Take a random chunk of HTML/Code, embed it, correlate it with previous classified chunks. Are the most similar chunks phishing or not, especially if the correlation is high? Use this as your answer.

Example of embedding as a pre-filter: Take a random chunk of HTML/Code, embed it, correlate it with previous classified chunks. Look at the most similar chunks. Is this chunk close to phishing? Send it to your fine-tune to find out.

So, in practice, you can do all three (or more).

  1. Embedding as a pre-filter to your fine tune. Use this to scan more of the DOM and feed only highly concerning parts to your fine-tune (since searching is cheaper and quicker than fine-tune calls).
  2. Embedding as a filter. Just straight up correlation on the embedding vectors, and use this as your classifier. (Lot’s of recursion, but may be as accurate as your fine-tune, and cheaper)
  3. Regex pattern matching or keyword matching (extremely fast, cheap, and accurate on “red flag” phishing patterns).

You can do all of these at the same time, and have some weighting scheme between all the results to form a final verdict.

Also boost your model (use higher than Ada) or increase your training set to improve your fine-tune. Increasing your training epochs could help, but there is a concern that if you go too high, your model will lose sensitivity to generalize its ability to detect phishing on data it has never seen, so proceed cautiously with this hyper-parameter.

Are the metrics referred to the training or the validation dataset? If they refer to the training dataset, this is a very likely case of overfitting, where the model has basically learned to memorize the main class and dismiss the other one.

What are the features that you’re targetting? Is this example from a cleaned <a> tag?
Your completion is: “completion”:" ph-%&". What does the -%& mean? I’m confused because you then put it as your stop. Is this an example from your training data, or a result? Are you trying to have the model pinpoint where the malicious entry is? If not, why not make it easier for the model and use a binary classifier of 1/0? You can then set your token length to 1

You should not need to hack your results using logit_bias. After 2,000 pieces of training data it should have been baked quite well unless there is something strange in the data.

How are you benchmarking your results? You have 2,000. Are you feeding all 2,000 in one go? Just to confirm, are you using a validation set?

I still don’t understand this. How are you checking each website? Your prompt is just one single URL. Are you extracting all the links from the DOM and then running each one through, and getting a collective >95% phishing result? Are there other elements that you are taking into consideration?

What is your criteria for a phishing link? In my experience it usually cannot be simply found in the url unless it’s blatantly obvious. Your prompt URL is strange, but it seems to trigger some script with a secondary link, (never seen : or used before but I stay away from the madness of PHP) but the main URL would need to know what to do with it. I don’t know how it could be possible to know if it’s a phishing scam without some investigation or unsafe assumptions.

Just looking at your prompt example myself, I can’t decide if it’s a phishing link or not. I mean, it could be, but websites do some crazy tricks and domain changes that could easily be confused as malicious.

Based on the output noise and the 95% I would have to guess that the training data needs to be cleaned, and the criteria needs to be properly defined. Perhaps for the next version you could design a rubric which helps you create your data. You could even run each piece through GPT-4 for a strong opinion. Slight inconsistencies add up. I think Curt is on the right track as well. Use every tool available and don’t rely just on fine-tuning. I bet a large amount of these “phishing links” could be caught with Regex. Jeez, I bet GPT-4 is pretty reliable.

I’m almost certain that in the future our schools will use students for labelling :rofl:

Yes, sadly I mean that it classify nearly all websites as phishing even though I have a ratio of 1/3 clean data and 2/3 phishing for the fine-tuning.

Thank you very much for your further advices. I will get into embeddings and try to evaluate this approach.

1 Like

I am not quite sure what you mean. The preparation tool splits up my jsonl file into training and validation on its own. In most cases it uses 2/3 of the data (clean and phishing) for training and 1/3 for validation. Based on that I get a report. With my current amaount of data I get on every metric a score of 1.0. My ML knowledge is limited but in case of overfitting shouldnt at least recall or precision be worse? It really suprised me with this results that a prompt that I used for training is classified wrong.

I was talking about this section of the guide. Just trying to ensure that you have included a validation file and enabled the flag compute_classification_metrics. But if you’re talking about precision and recall, then the answer is yes :slight_smile:

1 Like

It doesn’t make sense, really.

If your results are >95% phishing for all prompts, including clean, then your validation accuracy must be very low.

Hi, thank you for the detailed response. This is my unique stop tag, and it is recommended here (OpenAI API). I could not use examples from the documentation like “###” because it could be part of an HTML page; therefore, I use a custom one for prompt and completion. My example is from the training data. As far as I know, it is not possible to send a completion as a request.

I am using a binary classifier, and I have a max_token limit of 1 in my parameters. I used the hack because strange responses came up like “#” or empty strings. With more data, it happened less frequently, but I wanted to eliminate it completely.

For the results, I use the “fine_tuned.results” function that gets the training and validation data created by the preparation tool.

My prompt is currently formatted as “checked URL | extracted features from URL.” The extracted features can be anything, such as all URLs on the website I check, all HTML-Tags, or the content of selected HTML-Tags like . This is one entry/line in my JSONL. I did this for all my clean and phishing URLs.

I have no criteria for phishing URLs; I get them from phishing feeds. And yes, phishing is complex, but in the end, the domain name is different from embedded URLs, or the phished data is not sent to a URL related to the domain name or embedded URLs. But this is the main question, and I want to know if my model can see more suspicious features and decide based on that.

Ooohhhh. I see. You are basing it being a phishing link if the actual link differs from what appears?

Such as

Good call on changing the unique separation token. I share the same sentiment. There is somewhere in the docs though that state that with enough data, the token itself doesn’t matter too much.

Yes, I did that. from the beginning. With less data the result was between 87 and 99% for all metrics but with the data mentioned above it is everywhere 1.0.

1 Like

This is somehow my main problem. The results indicate that the model was trained until it is perfect but it even fails for his own training data (and for many other clean URLs like Google, Amazon, … too. I am not sure if the results are faulty or I am not used this “perfect” system correctly and get bad responses because of that.

Well, the URL is one feature I evaluating. There are obviously many more possible features that can indicate phishing like the usual phrasing. But the results are for every feature perfect. At least in the results.csv

No, I am not basing it on the representation of an link. It is more about the logic behind phishing. would not contain a formular that sends data to But as I mentioned this is just one possible approach.

If you aren’t already I recommend using:

It is just…fantastic…beautiful almost.
It syncs with your training data. Very little overhead

If you are, could you share your graphs?


In what case would securebank allow for this? Assuming it’s a phishing attempt it wouldn’t be able to hook into securebank and use their URL unless the bank itself was hacked or there’s some serious failure of security.

Second, I believe CORS would prevent this from even happening. A website “such as OpenAI dot com” cannot use your browser to send requests to “securebank dot com”. Perhaps it could redirect to secur3bank dot com, but that would be an easy comparison to catch

I dont want to drift to much away from my problem but obviously the real would not allow this. But people get phishing when they go on that embedds all kind of references to the real bank to deceive the victim. From a human perspective securbank[.]com [securEbank[.]com/logo.png, securEbank[.]com/header.jpg, securbank[.]com/phish_credentials.js] would cause a phishing verdict by the model.

I will try the website and hope that it will show me some more insight into my stats.