Fine-Tuning GPT-2 model to remove comments from Python code

Hello Everyone,

Can anyone please let me know what should be the format of my training data when I’m planning to Fine-tune GPT-2 model to remove comments from a source code.

Which model from GPT-2 family is recommended for this use-case?!

Thanks in advance!

See the fine-tuning guide for details on file format.

Is there a reason you’re looking to use fine-tuning for this? Seems a bit overkill when basic regex can likely handle it.

I would also suggest using a standard algorithm for that.

The following conversation might be helpful. I asked GPT-4 to write a program that will remove comments from a source code:

I haven’t tested the code and it will probably require some further corrections (you might ask GPT-4 for that).

1 Like

Yes, I am aware that it can be done using basic regex, but we just need learn capabilities of GPT-2 model. Whether it’s possible or not.


Yes , I’ll go through the code and check any required corrections. Unfortunately I still have no access to GPT-4 model. Have applied for the Waitlist.


1 Like

Possible and “best use of the technology” are very different things. Unless you know you need to fine-tune an LLM, you probably don’t need to fine-tune an LLM. Fine-tuning should be done when you’ve exhaustively integrated embeddings and/or few shot prompts.

Hi @novaphil @damc4 ,

I tried few steps of fine-tuning along with data pre-processing for the same use-case.

So I have got total 637 pairs of input and label tensors after adding EOS and (starting point) along with few pre-processing steps required.

Is it normal for the model to take like a good amount of time to train on input data just because I introduced EOS,etc. It has been 15 mins since the cell is running.

Visual representation of Data :-

Input data : code = [“print(‘hello world’)”,“#This is a print statement”,‘var=2’]
Label Data: code_wo_comment = [‘print(‘hello world’)’,“#This is a print statement”,‘var=2’]

Operations :-

encoded_inputs = tokenizer([‘’ + code for code in code_with_comments], padding=‘max_length’, max_length=1024, truncation=True)

encoded_labels = tokenizer([‘’ + code for code in code_without_comments], padding=‘max_length’, max_length=1024, truncation=True)

I think above mentioned data format might give us a answers at granular level.

Does all of this seem valid?
Thanks in advance!