Is it possible finetune with unlabeled data and then labeled data?

Hi,
I have a large dataset, each data sample is a json and might have more than 16k token. Bu for now I am going with only ~16k examples. My goal is to have a model that generates the json from user prompt.

Since my dataset is huge(200k~), I cannot label all. But I can label around 500 of them.
So I am wondering how to use unlabeled data to finetuning or training gpt3.5-turbo. I need this to teach the model the structure and relations between key values of json. Then I can train with labeled data with instructions.

My questions:

  1. Is it possible to finetune with non labeled data?
  2. If yes how I should prepare my messages to train. What should be the content of each role(system, user, assistant)

Any advice and suggestions are highly appreciated

Heya! Welcome to the forum.

Not exactly sure what you mean here by “label”…

Setting up a fine-tuning dataset is relatively straightforward. You can find more here

We should probably backtrack, though, and see why you think fine-tuning might help here?

You know it can produce JSON without fine-tuning, correct?

Can you lay out more what you’re trying to accomplish?

3 Likes

Hi @PaulBellow , thanks for you reply.
Every sample in my dataset is JSON representation of UI design.
EX:

{"location": {"x": 45, "y": 60}, "color": "blue", "text": "OK", }

and my ‘label’ or ground truth or user prompt will be Create a blue button with OK text

‘label’ is here ground truth or user prompt.

I am hoping that with proper user prompts the model can learn(understand) the relations between {key: values}.

BUT, I cannot write user prompt (content for user role) for all my dataset so I want to feed large portion of dataset into model and make it learn itself. Hoping that it will learn how to predict next token for my examples.

After that I will further train with small dataset which I will setup like a chat

{"messages": [{system...},{"role": user, "content": "create a blue button with OK text"}, {"role": assisstant, "content": "json_object" }]}
1 Like

Hi!

Before we get deeper into the process of properly fine tuning a model, looking at your example I believe this could be done by GPT 3.5 and a script.

Try to split the task into several steps:
What should be created?
What color should the element have?
Where to position the element?

Then run a script to properly format the output.

This can save time and money while sending in a few examples in the system prompt of what the correct answer would look like in different cases.

Hi @vb ,
Than you for you reply!

My dataset is significantly more complex than above example, often containing objects nested to depths of 10 layers or more. A crucial aspect of these objects is their interconnected nature; alterations in one {key: value} pair necessitate corresponding changes elsewhere

That is why I meant above “I am hoping that with proper user prompts the model can learn(understand) the relations between {key: values}.

Yes.DSPy.

But please be warned. It will be costly to generate (and use ) this prompt.