Regarding Text Preprocessing for Fine Tuning

Hey,

Does anyone have any information regarding how to correctly preprocess text before submitting it to GPT-3? We’re training a YouTube description generator with the prompt being the title and an appropriate description being the completion. The problem is, YouTube allows all sorts of characters, should we be modify the text before training, just some examples are

  1. Replacing double spaces
  2. Removing emojis
  3. Making the text lowercase

I was wondering if there’s a standardized algorithm for running against user-text to normalize it to something more understandable by GPT-3 or maybe better results are generated by not doing these things.

Any advice is appreciated!

1 Like

It’s entirely up to you. I would just read up on YouTube SEO and conform to whatever folks say work best for YouTube video descriptions.

Maybe I should update my video descriptions… LOL

Hi, this post has nothing to do with SEO or YouTube descriptions, this is about how to appropriate preprocess user input text before training in general and not limited to a specific text category. Although it sounds nice, I feel like it being entirely up to me is a massive mistake and absolutely will not produce the best results which I desire.

I think this might be something to determine by testing your prompts with and without specific characters etc. Perhaps you can identify features that make the prompt worse and remove those (like double spaces).

For authenticity I suggest you keep the text as close to the original as possible. Certainly you don’t need to lower case everything. Are you familiar with Notepad++? It makes text editing easier. I suggest you check your text for non-ascii characters, because those may cause errors with GPT-3. You can search for them by putting this into the search bar in notepad++:
[^\x00-\x7F]+
Make sure “regular expression” is checked as an option. Once you find them, you’ll be able to pick a suitable replacement or just delete. Removing double spaces is smart for saving tokens. Just search and replace 2 spaces with 1. (Repeat until there are no more double spaces left. This works to get rid of triple or more blank spaces.) I hope that’s helpful as a start.

1 Like

I have completed some AI projects which involved scraping reviews, pre-processing the text content, and then processing with GPT3 to categorize customer sentiment related to business features, service, and staff.

We used the python NTLK library over an API to clean and condense all text prior to AI digestion.
The added bonus is you are able to fit more content into the 4000 token limit.

Davinci 2 understands the cleaned/pre-processed text very well, even after all stop words have been removed!

https://www.nltk.org/

1 Like