Code for title selection using openai

Load your dataset

df = pd.read_excel(“/content/sample_data/report.xlsx”)
df = df.dropna(subset=[“desc”])
df = df.drop_duplicates(subset=[“desc”])
df = df.reset_index(drop=True)

Clean text data

def cleantwt(twt):
if isinstance(twt, str):
twt = re.sub(“RT”, " “, twt)
twt = re.sub(”#[A-Za_z0-9]+“, “”, twt)
twt = re.sub(r’[۰-۹]+‘, ‘’, twt)
twt = re.sub(’[۰۱۲۳۴۵۶۷۸۹]+‘,’‘,twt)
twt = re.sub(r’[A-Za-z]', ‘’, twt)
twt = re.sub(”‘’“, “”, twt)
twt = re.sub(”\n", “”, twt)
twt = re.sub(“https?://\S+”, “”, twt)
twt = re.sub(‘@[\S]‘, ‘’, twt)
twt = re.sub(’[1]+|[\s]+', '', twt) twt = re.sub(r'[+&+;+]|[><!+،:,().+،+٬+,+]|[-+]|[…]|[[]»«//]|[\]|[#+]|[_+]|[٪+]|[%]|[
+]|[؟+]|[?+]|[“”]|@’, ’ ',twt)
return twt
else:
return twt

df[“clean_text”] = df[“desc”].apply(cleantwt)

Load stopwords

with open(‘/content/sample_data/stopwords.txt’, ‘r’, encoding=‘utf-8’) as file:
stopwords_ur = file.read().splitlines()

Remove stopwords

def remove_stopwords(text):
return [word for word in text if word not in stopwords_ur]

Tokenize and remove stopwords

def preprocess(text):
tokens = gensim.utils.simple_preprocess(text)
tokens_no_stopwords = remove_stopwords(tokens)
return tokens_no_stopwords

Apply preprocessing

df[“clean_text_no_stopwords”] = df[“clean_text”].apply(preprocess)

Create N-grams

def make_n_grams(texts):
bigram = gensim.models.Phrases(texts, min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram = gensim.models.Phrases(bigram[texts], threshold=100)
trigram_mod = gensim.models.phrases.Phraser(trigram)
bigrams_text = [bigram_mod[doc] for doc in texts]
trigrams_text = [trigram_mod[bigram_mod[doc]] for doc in bigrams_text]
return trigrams_text

Apply N-grams

data_ngrams = make_n_grams(df[“clean_text_no_stopwords”])
data_strings = [" ".join(doc) for doc in data_ngrams]

id2word = corpora.Dictionary(data_ngrams)
texts = data_ngrams
corpus = [id2word.doc2bow(text) for text in texts]
code after this


  1. \s ↩︎

AI Python specialist to explain what the heck is going on here, since there is nothing “OpenAI” in the code and no purpose or audience described.


The code snippet you provided appears to be part of a data processing pipeline using Pandas and several natural language processing utilities in Python. Here’s a breakdown and corrected version of what each segment does:

Corrected and Explained Code

Initial Data Loading and Cleaning

import pandas as pd
import re

# Load your dataset
df = pd.read_excel("/content/sample_data/report.xlsx")
df = df.dropna(subset=["desc"])  # Drop rows where 'desc' column is NaN
df = df.drop_duplicates(subset=["desc"])  # Drop duplicate rows based on 'desc' column
df = df.reset_index(drop=True)  # Reset index for the dataframe

Text Data Cleaning Function

This function is designed to clean tweet-like text data. It removes retweets, hashtags, URLs, usernames, and non-Latin characters, and tries to clean up various Unicode and typographic artifacts.

def cleantwt(twt: str) -> str:
    if isinstance(twt, str):
        twt = re.sub(r"RT", "", twt)  # Remove "RT" retweet text
        twt = re.sub(r"#[A-Za-z0-9]+", "", twt)  # Remove hashtags
        twt = re.sub(r'https?://\S+', '', twt)  # Remove URLs
        twt = re.sub(r'@[A-Za-z0-9]+', '', twt)  # Remove usernames
        twt = re.sub(r'[^\w\s]', '', twt)  # Remove non-alphanumeric characters
        twt = re.sub(r'\s+', ' ', twt)  # Replace multiple spaces with a single space
        return twt.strip()  # Strip whitespace from the beginning and end of the string
    else:
        return twt

df["clean_text"] = df["desc"].apply(cleantwt)

Stopwords Removal Setup

This segment loads a list of stopwords, typically used to remove common but unimportant words from text data during processing.

# Load stopwords
with open('/content/sample_data/stopwords.txt', 'r', encoding='utf-8') as file:
    stopwords_ur = file.read().splitlines()

# Function to remove stopwords
def remove_stopwords(text: str) -> list:
    return [word for word in text.split() if word not in stopwords_ur]

# Apply Stopwords Removal
df["clean_text_no_stopwords"] = df["clean_text"].apply(remove_stopwords)

N-Gram Creation

This function constructs bi-grams and tri-grams from the text data. These are sequences of two or three words used together, which can be useful for text analysis.

import gensim

def make_n_grams(texts: list) -> list:
    bigram = gensim.models.Phrases(texts, min_count=5, threshold=100)  # Create bi-grams
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram = gensim.models.Phrases(bigram[texts], threshold=100)  # Create tri-grams
    trigram_mod = gensim.models.phrases.Phraser(trigram)
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

# Apply N-Grams function
data_ngrams = make_n_grams(df["clean_text_no_stopwords"].tolist())
df["ngrams"] = data_ngrams

# Convert n-grams into a format suitable for further analysis
data_strings = [" ".join(doc) for doc in data_ngrams]

Overall, this code is a typical example of a text processing pipeline that cleans and prepares data for natural language processing tasks, possibly as a prelude to machine learning or data analysis. It cleans the text by removing unnecessary clutter and prepares it for further processing like topic modeling or sentiment analysis.


The AI of a ChatGPT GPT, powered by their version of gpt-4-“turbo”, didn’t include some lines of Arabic stripping when rewriting.