Can anyone please let me know what should be the format of my training data when I’m planning to Fine-tune GPT-2 model to remove comments from a source code.
Which model from GPT-2 family is recommended for this use-case?!
Thanks in advance!
See the fine-tuning guide for details on file format.
Is there a reason you’re looking to use fine-tuning for this? Seems a bit overkill when basic regex can likely handle it.
I would also suggest using a standard algorithm for that.
The following conversation might be helpful. I asked GPT-4 to write a program that will remove comments from a source code:
I haven’t tested the code and it will probably require some further corrections (you might ask GPT-4 for that).
Yes, I am aware that it can be done using basic regex, but we just need learn capabilities of GPT-2 model. Whether it’s possible or not.
Yes , I’ll go through the code and check any required corrections. Unfortunately I still have no access to GPT-4 model. Have applied for the Waitlist.
Possible and “best use of the technology” are very different things. Unless you know you need to fine-tune an LLM, you probably don’t need to fine-tune an LLM. Fine-tuning should be done when you’ve exhaustively integrated embeddings and/or few shot prompts.
Hi @novaphil @damc4 ,
I tried few steps of fine-tuning along with data pre-processing for the same use-case.
So I have got total 637 pairs of input and label tensors after adding EOS and (starting point) along with few pre-processing steps required.
Is it normal for the model to take like a good amount of time to train on input data just because I introduced EOS,etc. It has been 15 mins since the cell is running.
Visual representation of Data :-
Input data : code = [“print(‘hello world’)”,“#This is a print statement”,‘var=2’]
Label Data: code_wo_comment = [‘print(‘hello world’)’,“#This is a print statement”,‘var=2’]
encoded_inputs = tokenizer([‘’ + code for code in code_with_comments], padding=‘max_length’, max_length=1024, truncation=True)
encoded_labels = tokenizer([‘’ + code for code in code_without_comments], padding=‘max_length’, max_length=1024, truncation=True)
I think above mentioned data format might give us a answers at granular level.
Does all of this seem valid?
Thanks in advance!