Is it possible finetune with unlabeled data and then labeled data?

mamur · March 16, 2024, 6:30am

Hi,
I have a large dataset, each data sample is a json and might have more than 16k token. Bu for now I am going with only ~16k examples. My goal is to have a model that generates the json from user prompt.

Since my dataset is huge(200k~), I cannot label all. But I can label around 500 of them.
So I am wondering how to use unlabeled data to finetuning or training gpt3.5-turbo. I need this to teach the model the structure and relations between key values of json. Then I can train with labeled data with instructions.

My questions:

Is it possible to finetune with non labeled data?
If yes how I should prepare my messages to train. What should be the content of each role(system, user, assistant)

Any advice and suggestions are highly appreciated

PaulBellow · March 16, 2024, 7:09am

Heya! Welcome to the forum.

Not exactly sure what you mean here by “label”…

Setting up a fine-tuning dataset is relatively straightforward. You can find more here…

We should probably backtrack, though, and see why you think fine-tuning might help here?

You know it can produce JSON without fine-tuning, correct?

Can you lay out more what you’re trying to accomplish?

mamur · March 16, 2024, 8:06am

Hi @PaulBellow , thanks for you reply.
Every sample in my dataset is JSON representation of UI design.
EX:

{"location": {"x": 45, "y": 60}, "color": "blue", "text": "OK", }

and my ‘label’ or ground truth or user prompt will be Create a blue button with OK text

‘label’ is here ground truth or user prompt.

I am hoping that with proper user prompts the model can learn(understand) the relations between {key: values}.

BUT, I cannot write user prompt (content for user role) for all my dataset so I want to feed large portion of dataset into model and make it learn itself. Hoping that it will learn how to predict next token for my examples.

After that I will further train with small dataset which I will setup like a chat

{"messages": [{system...},{"role": user, "content": "create a blue button with OK text"}, {"role": assisstant, "content": "json_object" }]}

vb · March 16, 2024, 8:52am

Hi!

Before we get deeper into the process of properly fine tuning a model, looking at your example I believe this could be done by GPT 3.5 and a script.

Try to split the task into several steps:
What should be created?
What color should the element have?
Where to position the element?

Then run a script to properly format the output.

This can save time and money while sending in a few examples in the system prompt of what the correct answer would look like in different cases.

mamur · March 18, 2024, 1:35am

Hi @vb ,
Than you for you reply!

My dataset is significantly more complex than above example, often containing objects nested to depths of 10 layers or more. A crucial aspect of these objects is their interconnected nature; alterations in one {key: value} pair necessitate corresponding changes elsewhere

That is why I meant above “I am hoping that with proper user prompts the model can learn(understand) the relations between {key: values}.”

icdev2dev · March 18, 2024, 4:36am

Yes.DSPy.

But please be warned. It will be costly to generate (and use ) this prompt.

Topic		Replies	Views
Prompting GPT3.5 for NER data labeling Prompting gpt-4 , gpt-35-turbo , chatgpt	18	4560	January 25, 2024
Fine tuning GPT-4o with large data source in system prompt API gpt-4 , fine-tuning , api , data-preparation , api-structured-data	0	125	January 8, 2025
Fine-tuning a Language Model to Generate dinamically specific JSON Structure without Prompting API openapi , fine-tuning , api	13	4213	May 24, 2023
Struggling with fine-tuning GPT for generating JSON API fine-tuning , fine-tuning-problems	1	353	July 9, 2024
Fine tuning custom grammars (new languages) API fine-tuning	3	89	November 19, 2024

Is it possible finetune with unlabeled data and then labeled data?

Related topics