Can anyone please let me know what should be the format of my training data when I’m planning to Fine-tune GPT-2 model to remove comments from a source code.
Which model from GPT-2 family is recommended for this use-case?!
Possible and “best use of the technology” are very different things. Unless you know you need to fine-tune an LLM, you probably don’t need to fine-tune an LLM. Fine-tuning should be done when you’ve exhaustively integrated embeddings and/or few shot prompts.
I tried few steps of fine-tuning along with data pre-processing for the same use-case.
So I have got total 637 pairs of input and label tensors after adding EOS and (starting point) along with few pre-processing steps required.
Is it normal for the model to take like a good amount of time to train on input data just because I introduced EOS,etc. It has been 15 mins since the cell is running.
Visual representation of Data :-
Input data : code = [“print(‘hello world’)”,“#This is a print statement”,‘var=2’]
Label Data: code_wo_comment = [‘print(‘hello world’)’,“#This is a print statement”,‘var=2’]
Operations :-
encoded_inputs = tokenizer([‘’ + code for code in code_with_comments], padding=‘max_length’, max_length=1024, truncation=True)
encoded_labels = tokenizer([‘’ + code for code in code_without_comments], padding=‘max_length’, max_length=1024, truncation=True)
I think above mentioned data format might give us a answers at granular level.