When composing a fine-tuning dataset, would it be effective to use only keywords?

Hi, guys. I have a question.

When composing a dataset for fine-tuning, can it recognize the user’s content using only keywords without expressing the entire question?

Let me give an example.

When wanting to train on user prompt data like ‘What is the singularity of artificial intelligence?’, using just the keyword ‘Artificial intelligence singularity’ in the user’s content.


Full

{“messages”:[{“role”:“user”,“content”:“What is the singularity of artificial intelligence?”}
,{“role”:“assistant”,“content”:"The singularity of artificial intelligence is… "}]}

Simplified

{“messages”:[{“role”:“user”,“content”:“Artificial intelligence singularity”},
{“role”:“assistant”,“content”:"The singularity of artificial intelligence is… "}]}


Can using such simplified data with only keywords for training still enable the model to respond as accurately as if it were trained with the full data?"

1 Like

Unless you or your users are only ever going to interact with the model using just keywords, this is likely a terrible idea.

You would essentially be trying to train out of the model its ability to ignore irrelevant and extraneous information.

2 Likes

I thought it could be handled since I knew the first chunk was processed separately, but it seems that’s not the case. However, I will give it a try. Thank you!

I thought it could be handled since I knew the first chunk was processed separately, but it seems that’s not the case. However, I will give it a try. Thank you!