Hi,
I have a large dataset, each data sample is a json and might have more than 16k token. Bu for now I am going with only ~16k examples. My goal is to have a model that generates the json from user prompt.
Since my dataset is huge(200k~), I cannot label all. But I can label around 500 of them.
So I am wondering how to use unlabeled data to finetuning or training gpt3.5-turbo. I need this to teach the model the structure and relations between key values of json. Then I can train with labeled data with instructions.
My questions:
Is it possible to finetune with non labeled data?
If yes how I should prepare my messages to train. What should be the content of each role(system, user, assistant)
and my ‘label’ or ground truth or user prompt will be Create a blue button with OK text
‘label’ is here ground truth or user prompt.
I am hoping that with proper user prompts the model can learn(understand) the relations between {key: values}.
BUT, I cannot write user prompt (content for user role) for all my dataset so I want to feed large portion of dataset into model and make it learn itself. Hoping that it will learn how to predict next token for my examples.
After that I will further train with small dataset which I will setup like a chat
{"messages": [{system...},{"role": user, "content": "create a blue button with OK text"}, {"role": assisstant, "content": "json_object" }]}
My dataset is significantly more complex than above example, often containing objects nested to depths of 10 layers or more. A crucial aspect of these objects is their interconnected nature; alterations in one {key: value} pair necessitate corresponding changes elsewhere
That is why I meant above “I am hoping that with proper user prompts the model can learn(understand) the relations between {key: values}.”