Regarding Text Preprocessing for Fine Tuning

Hey,

Does anyone have any information regarding how to correctly preprocess text before submitting it to GPT-3? We’re training a YouTube description generator with the prompt being the title and an appropriate description being the completion. The problem is, YouTube allows all sorts of characters, should we be modify the text before training, just some examples are

  1. Replacing double spaces
  2. Removing emojis
  3. Making the text lowercase

I was wondering if there’s a standardized algorithm for running against user-text to normalize it to something more understandable by GPT-3 or maybe better results are generated by not doing these things.

Any advice is appreciated!

1 Like

It’s entirely up to you. I would just read up on YouTube SEO and conform to whatever folks say work best for YouTube video descriptions.

Maybe I should update my video descriptions… LOL

Hi, this post has nothing to do with SEO or YouTube descriptions, this is about how to appropriate preprocess user input text before training in general and not limited to a specific text category. Although it sounds nice, I feel like it being entirely up to me is a massive mistake and absolutely will not produce the best results which I desire.

1 Like

I think this might be something to determine by testing your prompts with and without specific characters etc. Perhaps you can identify features that make the prompt worse and remove those (like double spaces).

For authenticity I suggest you keep the text as close to the original as possible. Certainly you don’t need to lower case everything. Are you familiar with Notepad++? It makes text editing easier. I suggest you check your text for non-ascii characters, because those may cause errors with GPT-3. You can search for them by putting this into the search bar in notepad++:
[^\x00-\x7F]+
Make sure “regular expression” is checked as an option. Once you find them, you’ll be able to pick a suitable replacement or just delete. Removing double spaces is smart for saving tokens. Just search and replace 2 spaces with 1. (Repeat until there are no more double spaces left. This works to get rid of triple or more blank spaces.) I hope that’s helpful as a start.

1 Like

I have completed some AI projects which involved scraping reviews, pre-processing the text content, and then processing with GPT3 to categorize customer sentiment related to business features, service, and staff.

We used the python NTLK library over an API to clean and condense all text prior to AI digestion.
The added bonus is you are able to fit more content into the 4000 token limit.

Davinci 2 understands the cleaned/pre-processed text very well, even after all stop words have been removed!

https://www.nltk.org/

1 Like

Hi, I am by no means an expert on this matter so if somebody knows more please feel free to correct me. That being said, in context per your question, whether or not you should handle all three of your examples ultimately depends on the tokenizer they are using to tokenize the prompt, as some tokenizers ignore casing while others don’t, some ignore spacing, etc.

My advice would be to use what other model creators have used for preprocessing text for fine-tuning, such as EleutherAI’s GPT-J. When making fine-tuning data from text, I noticed three major steps:

  1. Text encoding. The GPT-J data-preprocessing script requires text to be encoded in CP932, which is much more restrictive than the more standard UTF-8 encoding. My text also contained invalid characters such as emojis, so what I did was load the text with UTF-8 encoding from a .csv file, and then used the ucp9 Python package to transcode the Unicode characters to CP932. This package is nice because it gives the option to remove invalid characters or simply replace them with question marks. In my case, I removed them.

  2. The GPT-J preprocessing script then included two preprocessing options, in which I used both. The first is to normalize the text data with Ftfy), which applies this line of code to the input data:

    if normalize_with_ftfy:  # fix text with ftfy if specified
        doc = ftfy.fix_text(doc, normalization='NFKC')
    

    You can look at what changes the fix_text() function is making, and it has several parameters that should allow you to preprocess the text to your liking.

  3. Their final preprocessing option called wikitext_detokenizer makes some smaller changes to the spacing that could probably be applied to all input data using a function that they wrote.

Ultimately, you will just have to test different normalization methods on your most problematic input data and see which normalization methods allow the tokenizer to create the most “accurate” output to your liking. A simple way you could go about doing this is to use the GPT tokenizers from Huggingface’s Transformers and visually check if the preprocessing methods properly allow the tokenizer to interpretably tokenize the input string.

I’d also brush up on what the tokenizer is actually doing for your specific model, and Huggingface has a great blog post that should give you insight on what preprocessing methods are required generally for text.

Hopefully this helps!

1 Like