Fine-Tuning of my personal blog

:raised_hand:Hi there! I’m beginner at learning fine tuning and pass 3 it 3 times. My purpose is creating the model who could use tone-of-voice from my blog. So, I did it. I took 600 000 tokens, prepared it to model. Finally I have got normal results, BUT. Model answering to me with 100-200 of symbols. Its good for short questions, but when I want ask model to write article, for example - she can’t.
In my training model I used telegram posts with 100-1000 symbols length. And I understand why results too short.

So, what can I do with this model to increase answer length? Maybe teach with new model? But how? Maybe is there some hints about this problem? Maybe I need to combine different posts and made it huge? But what about logical result?

Global task: create model from my blog (short messages), which can write long articles without hallucinations. :slight_smile:

Thanks everyone for helping!

1 Like

Make sure your model can be fine tuned: OpenAI Platform

Also, you may wanna check for troubleshooting: OpenAI Platform

maybe because of models gpt-3.5-turbo-0613 if gpt-3.5-turbo-16k available in fine-tuning it would be great

The problem here is not the size of the context window but the examples in the training data:

When you’re fine-tuning, you need to provide examples of the content you want to generate. For example, if you fine-tune from Telegram posts, you’ll get a Telegram bot. If you want to fine-tune GPT-3.5 to write blog posts in your style, then you’ll need to provide approximately 50 good examples of blog posts you’ve written yourself, along with an appropriate prompt.

2 Likes

By using gpt-3.5-turbo in your fine-tune base, you also bring with it the pretraining baggage in only producing particular styles and lengths of output that must be overcome.

fine-tune ideas:

  1. ability examples: “can you write a whole book”: “certainly, I have no output length limitations.”

  2. “example” examples: “write a six-page 100 paragraph blog entry for me on Arduino”: sure, (train 4000 tokens of writing).

2 Likes

@_j does this good ?

Yes, congrats your training file was refused for having too much content, just like the message and documentation explains.

You can run the combination of inputs and also the response through a tokenizer:

https://tiktokenizer.vercel.app/

Or write your own file validator using tiktoken, and allow for some message overhead, 7 tokens +4 per message.

@_j that tiktokenizer would be better if having feature import from a json / jsonl file and other export (if really needed) so don’t have to put manually one by one

tiktoken, which I mentioned above, is a python token counting library that you can install simply with pip.

pip install tiktoken

Because I’m a fun guy, I just sat down and coded a per-line token counter for jsonl training files.

import json
import tiktoken

class Tokenizer:
    def __init__(self, model_name):
        self.tokenizer = tiktoken.get_encoding(model_name)

    def count(self, text):
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)

    def message(self, message):
        for msg in message:
            role_string = msg['role']
            if 'name' in msg:
                role_string += ':' + msg['name']
            role_tokens = self.count(role_string)
            content_tokens = self.count(msg['content'])
            msg['tokens'] = 3 + role_tokens + content_tokens
        return message if len(message) > 1 else message[0]


cl100k = Tokenizer("cl100k_base")
# Example string
trainfile = """
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
""".strip()

# Uncomment the following lines to read from a file instead
# with open('filename.jsonl', 'r') as f:
#     trainfile = f.read()
max_line = 52  # report over this many tokens (4080)
tokentable = []; over_max_line = []; total_tokens_list = []
for i, line in enumerate(trainfile.split('\n'), start=1):
    try:
        jsonl_line = json.loads(line)
        token_msg_list = cl100k.message(jsonl_line['messages'])
        total_tokens = 3  # overhead
        # Iterate through all the dictionaries in the list
        for msg in token_msg_list:
            # Get all the token values and add them
            total_tokens += msg.get('tokens', 0)
            if total_tokens > max_line:
                print(f"line {i}:Total tokens are over {max_line}")
                over_max_line.append(i)
        total_tokens_list.append(total_tokens)
        #tokentable.append(token_msg_list)
    except json.JSONDecodeError:
        print(f"Error decoding JSON on line {i}: {line[:320]}")

print(total_tokens_list)

It uses a stripped-down class I already wrote to add token metadata to lists of messages. (If you have functions, you’ll need a preprocessor to add the AI-received language to the system message for token counting)

The other things you can edit within code should be apparent upon understanding what it does.

Execution on the “marv” example and a low threshold:

line 3:Total tokens are over 52
[46, 51, 54]

It should also barf errors or crash appropriately on bad files.

1 Like


So, it is not working. What if I have only short Telegram posts, and want to create the model which could write long articles? Its impossible?

You are getting a pretrained refusal. You can remove one of the two criteria from your prompt, and pick numbers like 85 paragraphs that will generally not get denied.

But you can see that your fine-tune doesn’t have enough inference ability to overcome what’s already in the model. You’ll need many “write 100 paragraph” examples where it actually performs the task at the length to re-train it that the command is acceptable.

Putting a whole bunch of useless refusals in the AI model to save money is a lot easier for OpenAI then actually training on quality writing.

I remember a similar case when developing my chess engine using gpt 4. It kept refusing to generate a lot of code for me. I couldn’t convince it. In the end, constant refusals helped me to realize that my current approach wasn’t the best and that I should have taken a step back and reconsidered other possibilities. So, maybe it’s a good thing when AI declined your request - that could potentially be a hint for you that you could perhaps look for other options.

1 Like

I’m agree with you, but trying to find the way to change model. What do you thing about result of finetuning:
1-Taking the 1st model
2-Create the new one. Second model is the same, but in “message”: i would insert 3-5 posts instead one
3-Combine it