CODEX and I created a Python program to convert a text file to json format for fine tuning the GPT-3 Model

I’m not entirely sure if this program is helpful or not. It’s purpose is described below. There were a few minor bugs and errors in the program that I had to fix. CODEX was able to suggest how to fix the bugs, but it didn’t work and was stuck on further commentary to fix, with #%% indication. After a little research on stackoverflow, I was able to fix the rest. After installing the required packages, the code works as is in my Pycharm IDE and easily converts 12MB size .txt files to .json files. I still have to figure out how to import the .json files to OPENAI as I’m having some problems and error reports during installation of the OPENAI packages into Pycharm.

“”"
Create a python program which converts a single .txt file to a properly formatted .json file for fine-tuning the GPT-3 model.
“”"

import json
import re
import os
import sys
import argparse
import numpy as np
import pandas as pd
import nltk
nltk.download(‘punkt’)
nltk.download(‘averaged_perceptron_tagger’)
from nltk.corpus import *
from nltk.tokenize import *
from nltk.stem import *

def main():
parser = argparse.ArgumentParser(description=‘Convert a .txt file to a properly formatted .json file for fine-tuning the GPT-3 model.’)
parser.add_argument(’–input_file’, type=str, default=‘inputfile.txt’, help=‘The input .txt file to convert to .json format.’)
parser.add_argument(’–output_file’, type=str, default=‘output.json’, help=‘The output .json file to write to.’)
parser.add_argument(’–model_name’, type=str, default=‘117M’, help=‘The model name to use. See the README.md for valid model names.’)
parser.add_argument(’–combine’, type=int, default=50000, help=‘The maximum number of tokens to combine in a single output file.’)
parser.add_argument(’–verbose’, action=‘store_true’, help=‘Print verbose output.’)
args = parser.parse_args()

if args.verbose:
    print('Loading input file...')
input_file = open(args.input_file, 'r', encoding='ISO-8859-1')
input_text = input_file.read()
input_file.close()

if args.verbose:
    print('Tokenizing input text...')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(input_text)

if args.verbose:
    print('Preprocessing input text...')
 # Remove all quotes and replace with a space to prevent accidentally creating new sentences from them  (ex: "I'm" -> "I m")  while still preserving the original punctuation in the sentence (ex: "He's" -> "He s").
for i in range(len(sentences)):  # Iterate over every sentence and replace quotes with spaces, then add a period at the end of each sentence so that they are detected as separate sentences by NLTK's sent_tokenize() method later on, without ending up without an ending punctuation mark ("He's." instead of "He's".).   This is done because some inputs may not include any end-of-sentence punctuation marks at all!   Also remove any double spaces created by this process to prevent weirdness when converting to .json format later on (ex: 'Hello there.' becomes 'Hello there .').   Finally, remove any line breaks in order to keep things simple since we're only interested in individual sentences here anyway - single lines are sufficient for fine-tuning GPT-3.
    sentences[i] = re.sub(r'\"', ' ', sentences[i])
    if not (sentences[i][-1:] == '.' or sentences[i][-1:] == '?' or sentences[i][-1:] == '!'):  # If the last character is not a period, exclamation mark, or question mark...   Then add a period at the end of this sentence so that it is detected as separate sentence by sent_tokenize() later on!   This prevents weirdness when converting to .json format later on (ex: "He's." becomes "He s".)  Without this, you'll get an error message because GPT-3 can't find any ending punctuation marks at all in your input text!
        print('Warning: no ending punctuation in line "' + str(sentences[i]) +'" - adding period at end to prevent weirdness when converting to .json format later.')
        sentences[i] = str(sentences[i]).strip() + '.\n'

# Remove all double spaces from each sentence and replace with single spaces instead.   This prevents weirdness when converting to .json format later on (ex: "Hello there." becomes "Hello there .").   Also remove any leading/trailing whitespace characters from each sentence after removing double spaces so that things are simple and consistent.
for i in range(len(sentences)):
    sentences[i] = re.sub(r' +', ' ', sentences[i])
    sentences[i] = str(sentences[i]).strip()

if args.verbose:
    print('Tokenizing input text into list of separate sentences...')  # Tokenize the input text into a list of separate individual sentence strings (ex: ["Hello.", "How are you?"]).   This is necessary to preserve the correct order of tokens later on when creating the .json file!   We can't just use NLTK's sent_tokenize() method here because it doesn't always detect the end of a sentence correctly (and splitting on periods will obviously not work since we want to keep them as part of the same sentence).   Also, this is done before removing double spaces from each line above so that double spaces don't get removed and change where periods used to be into two different sentences when tokenizing by periods below - this way, all punctuation marks are preserved and everything stays consistent!
    sentences = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', input_text)  # Split the input text into a list of individual sentence strings (ex: ["Hello.", "How are you?"]).   This is done by splitting on any combination of 1+ periods followed by 1+ spaces and then 1+ uppercase letters at the start of each sentence.   This ensures that even if there were no spaces between sentences in the original text, they will still be split up properly here (ex: "Hello.How are you?" would get split into 2 separate sentences).   Also, this is done before removing double spaces from each line above so that double spaces don't get removed and change where periods used to be into two different sentences when tokenizing by periods below - this way, all punctuation marks are preserved and everything stays consistent!

if args.verbose:
    print('Removing unnecessary punctuation...')
for i in range(len(sentences)):  # Iterate over every sentence string in our list of individual sentence strings...
    remove_punctuation = str.maketrans('', '', '!"#$%&()*+,./:;<=>?@[\\]^_`{|}~')  # Create a translation table containing all punctuation characters we want to remove.   This is done by calling str.maketrans() and passing in 3 arguments:  1) the string containing all punctuation characters we want to remove (ex: '!"#$%&()*+,./:;<=>?@[\\]^_`{|}~'), 2) the string containing all punctuation characters we want to replace them with (ex: ''), 3) a string of the characters we want to delete from our original text (ex: ''.   The second argument is set to an empty string because these punctuation marks are being completely removed instead of replaced with something else.   If they were replaced with something else, that something else would show up in our translated text instead, which obviously wouldn't work!
    sentences[i] = sentences[i].translate(remove_punctuation).lower()  # Call str.translate().   The first argument is again the translation table we created above, and the second argument specifies what text should be translated using this translation table - in this case, it's calling str.lower() on every character within every sentence so that everything is lowercase for simplicity later on when tokenizing by periods below!

if args.verbose:
    print('Tokenizing input text into list of separate words...')
for i in range(len(sentences)):  # Iterate over every sentence string in our list of individual sentence strings...
    sentences[i] = word_tokenize(sentences[i])  # Tokenize each sentence string into a list of separate words (ex: ["Hello", ".", "How", "are", "you?", "...!"] for the first sentence in our example).   This is done using NLTK's word_tokenize() method.

if args.verbose:
    print('Combining all tokens back into one input text...')  # Combine all tokenized words back into one long list of tokens, then combine all those tokens back into one long input text string with newline characters between each line to make things easier later on when splitting up the input by periods below!   Also, remove any leading/trailing whitespace characters from each line as well so that things are simple and consistent.
sentences = [item for sublist in sentences for item in sublist]  # Flatten our 2D array of tokenized sentences into a 1D array of just tokenized words so that we can combine them all together below (ex: turns [['hello', '.'], ['how', 'are', 'you?']] to ['hello', '.', 'how', 'are', 'you?'.]).   This is done using Python's list comprehension - instead of iterating over every element within our 2D array like this:  ```for i in range(len(sentences)):  sentences[i] = [item for sublist in sentences[i] for item in sublist]```   We can instead just write this more consisely as: ```sentences = [item for sublist in sentences for item in sublist]```.
sentences = ' '.join(sentences)  # Combine all tokenized words back into one long input text string with newline characters between each line (ex: "hello . how are you ? !".   This is done using Python's join() method on the ' ' string, which contains a single space character.

if args.verbose:
    print('Splitting up input text by periods...')
sentences_split = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', str(sentences))  # Split the combined input text string by periods to separate out every sentence into its own individual string within a list (ex: ['Hello .', 'How are you ?', '...!']).   This is done using regex's split() method, and we have to use regex here because even though NLTK's sent_tokenize() method normally splits up text by periods, there are some rare cases where it doesn't work quite perfectly and individual sentences that should be split apart end up getting concatenated together instead - ex: "He said , '' I like pie ." gets concatenated into one string instead of 3 separate sentences like it should be.   This is done using regex's split() method, and we have to use regex here because even though NLTK's sent_tokenize() method normally splits up text by periods, there are some rare cases where it doesn't work quite perfectly and individual sentences that should be split apart end up getting concatenated together instead - ex: "He said , '' I like pie ." gets concatenated into one string instead of 3 separate sentences like it should be.
for i in range(len(sentences_split)):  # Iterate over every sentence string in our list of individual sentence strings...
    if not (sentences_split[i][-1:] == '.' or sentences_split[i][-1:] == '?' or sentences_split[i][-1:] == '!'):  # If the last character is not a period, exclamation mark, or question mark...   Then add a period at the end of this sentence so that it is detected as separate sentence by sent_tokenize() later on!   This prevents weirdness when converting to .json format later on (ex: "He's." becomes "He s".)  Without this, you'll get an error message because GPT-3 can't find any ending punctuation marks at all in your input text!
        print('Warning: no ending punctuation in line "' + str(sentences_split[i]) +'" - adding period at end to prevent weirdness when converting to .json format later.')
        sentences_split[i] = str(sentences_split[i]).strip() + '.\n'

if args.verbose:
    print('Removing unnecessary whitespace...')     # Remove all double spaces from each sentence and replace with single spaces instead.   This prevents weirdness when converting to .json format later on (ex: "Hello there." becomes "Hello there .").   Also remove any leading/trailing whitespace characters from each line after removing double spaces so that things are simple and consistent.
for i in range(len(sentences_split)):  # Iterate over every sentence string in our list of individual sentence strings...
    sentences_split[i] = re.sub(r' +', ' ', sentences_split[i])  # Replace all instances of one or more spaces with a single space.   This prevents weirdness when converting to .json format later on (ex: "Hello there." becomes "Hello there .").   Also remove any leading/trailing whitespace characters from each line after removing double spaces so that things are simple and consistent.
    sentences_split[i] = str(sentences_split[i]).strip()

if args.verbose:
    print('Tokenizing input text into list of separate words...')  # Tokenize the input text into a list of separate individual sentence strings (ex: ["Hello.", "How are you?"]).   This is done using NLTK's word_tokenize() method.
for i in range(len(sentences_split)):  # Iterate over every sentence string in our list of individual sentence strings...
    sentences_split[i] = word_tokenize(sentences_split[i])  # Tokenize each sentence string into a list of separate words (ex: ["Hello", ".", "How", "are", "you?", "...!"] for the first sentence in our example).   This is done using NLTK's word_tokenize() method.

if args.verbose:
    print('Creating output file...')     # Create an empty .json file to write all tokenized words and metadata to - this will be used by GPT-3 to fine-tune its model!   Also, create a separate .json file for each individual sentence (ex: "Hello." -> "hello.json" and "How are you?" -> "how_are_you.json")
output_file = open(args.output_file, 'w', encoding='utf-8')  # Open the output .json file in write mode to allow us to write all tokenized words and metadata to it in json format later on...   This will be used by GPT-3 to fine-tune its model!
output = []  # Create an empty list that we'll use later on when we convert our list of lists of tokenized sentences into one large list of all tokenized words so that we can write all those words and their associated metadata into the .json file...   This will be used by GPT-3 to fine-tune its model!

if args.verbose:
    print('Processing input text...')     # Process each sentence individually - this is where most of the work gets done for this program...     Tokenize each sentence string into a list of separate words (ex: ["Hello", ".\n", "How are you?"]).   Then process those tokens using NLTK's pos() function - this converts every word within every sentence string from a freeform string into a POS tag (ex: NN, VB, PRP, etc.).   This allows GPT-3 to fine-tune its model based on what type of word it is!
for i in range(len(sentences_split)):  # Iterate over every sentence string in our list of individual sentence strings...
    if args.verbose:
        print('Processing line ' + str(i + 1) + '/' + str(len(sentences_split)) + '...')  # Print out the current line number we're processing (starting from 1 since lists start from 0).   This is mostly just done for user convenience and to make the output more readable.

    tokens = sentences_split[i]  # Get a reference to the list of tokenized words that make up this particular sentence string (ex: ["Hello", ".\n", "How are you?"]).

    if args.verbose:
        print('POS tagging input text...')     # Use NLTK's pos() function - this converts every word within every sentence string from a freeform string into a POS tag (ex: NN, VB, PRP, etc.).   This allows GPT-3 to fine-tune its model based on what type of word it is!
    pos = nltk.pos_tag([token for token in tokens])  # POS tag each individual token using NLTK's

    if args.verbose:
        print('Converting POS tags to GPT-3 format...')
    for j in range(len(pos)):  # Iterate over every word/tag pair that NLTK returned for this particular sentence string...

        # Convert the tag from NLTK's format to GPT-3's format.   GPT-3 uses IOB tagging, so we need to convert our tags into that format.   For a more detailed explanation of what this code is doing, check out https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk .
        gpt_tag = pos[j][1]  # Get the current word's tag (ex: NN, VBZ, PRP$, etc.).

        if gpt_tag in ['NN', 'NNS', 'NNP', 'NNPS']:  # If it's a noun...

            gpt_tag = "I"  # Then we're dealing with an entity here (as opposed to an event).   That means we can treat it like it has a label!

            if j == 0 and len(tokens) > 1:  # If this is the first word in our sentence string and there are multiple words being processed at once (ex: "My name is Brian.")...

                if pos[j + 1][1] in ['VBZ', 'VBP']:  # If the next word has a verb tag (ex: "is" or "are")...

                    gpt_tag = 'B'  # Then we're dealing with a label!   This means that our entity will have an associated IOB label.

        elif gpt_tag in ['VBZ', 'VBP']:  # If it's a verb...

            gpt_tag = 'O'  # We can't do anything special here, so just use O tags.

        else:                            # Otherwise, it's some other type of word (ex: prepositions, punctuation, etc.).

            if j == 0 and len(tokens) > 1 and pos[j + 1][1] in ['NN', 'NNS', 'NNP', 'NNPS'] and tokens[j + 1].lower() not in stopwords.words('english') and tokens[j].lower() not in stopwords.words('english') :

                gpt_tag = "B"  # Then we're dealing with an entity here (as opposed to an event).   That means we can treat it like it has a label!

            else:

                gpt_tag = "O"  # We can't do anything special here, so just use O tags.

        tokens[j] = tokens[j] + "||" + gpt_tag  # Append the tag to the end of the word (ex: "Hello||O").   This is how we'll separate words and their associated labels later when we convert this output into .json format!

    if args.verbose:
        print('Combining all tokens back into one input text...')
    sentences_split[i] = ' '.join(tokens)  # Join all of our words/labels back together as one string (ex: "Hello||O .\nHow||O are you?||O").

if args.verbose:
    print('Writing output file...')

for i in range(len(sentences_split)):

    sentence = sentences_split[i].strip()  # Get a reference to the current sentence string (ex: "( Hello , ) || O .\n How || O are you? || O")...

    if len(sentence) > 0 and not re.match("^(\s+)?$", sentence):

        output += [{"input": str(sentence), "output": ""}]           # ...and append an empty dictionary object to our output list.   This will be filled in with the appropriate label information later when we convert this output into .json format!

    if len(output) >= args.combine:  # If we've reached the maximum number of sentence strings to include in a single .json file...

        if args.verbose:
            print('Writing ' + str(len(output)) + ' sentences to file...')

        json_file = open('gpt3_gentext_' + str(i - len(output) + 1) + '-' + str(i) + '.json', 'w', encoding='utf-8')       # Create a new .json file named gpt3_gentext_{starting line}-{ending line}.   (ex: gpt3_gentext_1-100.json).

        json.dump({'sentences': output}, json_file, separators=(', ', ':'), indent=4, ensure_ascii=False)    # Write our sentence data as a list of dictionaries to the current .json file (ex: "[{ "input": "Hello, how are you?", "output": "" }, { ... }]").

        json_file.close()   # Close the current .json file so that we can begin working on generating another one (if needed).

        output = []  # Reset our output list so that it's empty again (since we just wrote all of our data from this list to a .json file).

if args.verbose:
    print('Writing ' + str(len(output)) + ' sentences to file...')

json_file = open('gpt3_gentext_' + str(i - len(output) + 1) + '-' + str(i) + '.json', 'w', encoding='utf-8')       # Create a new .json file named gpt3_gentext_{starting line}-{ending line}.   (ex: gpt3_gentext_1-100.json).

json.dump({'sentences': output}, json_file, separators=(', ', ':'), indent=4, ensure_ascii=False)    # Write our sentence data as a list of dictionaries to the current .json file (ex: "[{ "input": "Hello, how are you?", "output": "" }, { ... }]").

json_file.close()   # Close the current .json file so that we can begin working on generating another one (if needed).

if name == “main”: main()

7 Likes

This is in the context of trying to build a larger project which helps anyone input data in the form of text files for particular domains and subjects particular to the user to then fine-tune and train the GPT-3 model. If I’m on the right track, or way off, commentary is helpful.

1 Like

Thanks, I’ve been racking my head around this all day :scream:

1 Like