Fine-Tuning GPT-2 model to remove comments from Python code

RP2000 · May 29, 2023, 12:42pm

Hello Everyone,

Can anyone please let me know what should be the format of my training data when I’m planning to Fine-tune GPT-2 model to remove comments from a source code.

Which model from GPT-2 family is recommended for this use-case?!

Thanks in advance!

novaphil · May 29, 2023, 3:49pm

See the fine-tuning guide for details on file format.

Is there a reason you’re looking to use fine-tuning for this? Seems a bit overkill when basic regex can likely handle it.

damc4 · May 29, 2023, 4:02pm

I would also suggest using a standard algorithm for that.

The following conversation might be helpful. I asked GPT-4 to write a program that will remove comments from a source code:

I haven’t tested the code and it will probably require some further corrections (you might ask GPT-4 for that).

RP2000 · May 31, 2023, 4:09am

Yes, I am aware that it can be done using basic regex, but we just need learn capabilities of GPT-2 model. Whether it’s possible or not.

Thanks!

RP2000 · May 31, 2023, 4:13am

Yes , I’ll go through the code and check any required corrections. Unfortunately I still have no access to GPT-4 model. Have applied for the Waitlist.

Thanks!

novaphil · May 31, 2023, 4:19am

Possible and “best use of the technology” are very different things. Unless you know you need to fine-tune an LLM, you probably don’t need to fine-tune an LLM. Fine-tuning should be done when you’ve exhaustively integrated embeddings and/or few shot prompts.

RP2000 · June 5, 2023, 10:17am

Hi @novaphil @damc4 ,

I tried few steps of fine-tuning along with data pre-processing for the same use-case.

So I have got total 637 pairs of input and label tensors after adding EOS and (starting point) along with few pre-processing steps required.

Is it normal for the model to take like a good amount of time to train on input data just because I introduced EOS,etc. It has been 15 mins since the cell is running.

Visual representation of Data :-

Input data : code = [“print(‘hello world’)”,“#This is a print statement”,‘var=2’]
Label Data: code_wo_comment = [‘print(‘hello world’)’,“#This is a print statement”,‘var=2’]

Operations :-

encoded_inputs = tokenizer([‘’ + code for code in code_with_comments], padding=‘max_length’, max_length=1024, truncation=True)

encoded_labels = tokenizer([‘’ + code for code in code_without_comments], padding=‘max_length’, max_length=1024, truncation=True)

I think above mentioned data format might give us a answers at granular level.

Does all of this seem valid?
Thanks in advance!

Topic		Replies	Views
Fine tuning for comment rating API fine-tuning	4	469	July 7, 2023
Training a custom model to create codes API	1	408	December 15, 2023
Fine-Tuning Setup for gpt-3.5-turbo-16k API fine-tuning , api	9	2603	October 31, 2023
Finetuned a model, but it replies like insane API	7	693	December 24, 2023
Fine-tuning GPT to learn a new coding language Prompting codex , chatgpt , plugin-development , fine-tuning , api	3	2467	December 24, 2023

Fine-Tuning GPT-2 model to remove comments from Python code

Related Topics