Codex: kmeans clustering of documents

This, my fourth tweak on the prompts, works out of the box, without modification. Pretty impressive!

Question for Open AI: what can we do that’s most helpful at this point?


"""

Python version 3.8 # this was because I know that there were dependency problems with scikit and numpy for >3.9

Write a program that does k-means clustering for documents.

Each document is a line in an input file named ../data/input.txt.

Create 10 clusters and provide labels.

"""

import numpy as np

import pandas as pd

from sklearn.cluster import KMeans

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import silhouette_score

from sklearn.preprocessing import normalize

# Read the input file

df = pd.read_csv('../data/input.txt', sep='\t', header=None)

df.columns = ['documents']

# Create TF-IDF matrix

vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(df['documents'])

# Normalize the matrix

tfidf_matrix = normalize(tfidf_matrix)

# Find the optimal number of clusters

silhouette_scores = []

for k in range(2, 11):

kmeans = KMeans(n_clusters=k, random_state=1).fit(tfidf_matrix)

labels = kmeans.labels_

silhouette_scores.append(silhouette_score(tfidf_matrix, labels, metric='euclidean'))

# Plot the silhouette scores

import matplotlib.pyplot as plt

plt.plot(range(2, 11), silhouette_scores)

plt.title('Silhouette scores vs number of clusters')

plt.show()

# Perform K-means clustering

kmeans = KMeans(n_clusters=10, random_state=1).fit(tfidf_matrix)

labels = kmeans.labels_

# Create a dataframe with the labels and document names

df_labels = pd.DataFrame(labels, columns=['cluster'])

df_labels['documents'] = df['documents']

df_labels.sort_values(by=['cluster'], inplace=True)

# Write the output to a file

df_labels.to_csv('../data/output.txt', sep='\t', index=False)
2 Likes

Codex is really powerful would your share playground link for same

1 Like
2 Likes

I tried three different prompts … I only saved two.

"""
Python version 3.8
Write a program that does k-means clustering with labels for text.
Treat each row as a single document.
Use Pandas.
"""

v3)

"""
Python version 3.8
Write a program that does k-means clustering for documents.
Each document is a line in the file ../data/input.txt
Create labels.
Cluster the documents.
Use Pandas.
"""

The first version did words rather than documents. I had to learn to emphasize that term.
The next version had the same problem.
The third version created a very short output program so in version four I used more compound sentences.
I already knew I wanted to use Pandas if possible since I already have a lot of dataframes in my pipeline, but I didn’t tell Codex about sklearn. No import issues since I already had it.

1 Like

Another thing that was pretty impressive to me is that it automatically added plotting and silhouette analysis, even though I didn’t ask for the first and didn’t even know about the second.

thanks well if remove the generation part and just user comments to generate it gives this

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.metrics import pairwise_distances
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import pairwise_kernels
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.metrics.pairwise import laplacian_kernel
from sklearn.metrics.pairwise import sigmoid_kernel
from sklearn.metrics.pairwise import polynomial_kernel
from sklearn.metrics.pairwise import chi2_kernel
from sklearn.metrics.pairwise import additive_chi2_kernel
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.metrics.pairwise import pairwise_distances_argmin_min
from sklearn.metrics.pairwise import pairwise_kernels
from sklearn.metrics.pairwise import pairwise_distances_chunked
from sklearn.metrics.pairwise import PAIRWISE_DISTANCE_FUNCTIONS
from sklearn.metrics.pairwise import PAIRWISE_KERNEL_FUNCTIONS
from sklearn.metrics.pairwise import PAIRWISE_BOOLEAN_FUNCTIONS
from sklearn.metrics.pairwise import PAIRED_DISTANCES
from sklearn.metrics.pairwise import check_pairwise_arrays
from sklearn.metrics.pairwise import check_paired_arrays
from sklearn.metrics.pairwise import paired_distances
from sklearn.metrics.pairwise import paired_euclidean_distances
from sklearn.metrics.pairwise import paired_manhattan_distances
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import minmax_norm
from sklearn.preprocessing import maxabs_scale
from sklearn.preprocessing import maxabs_norm
from sklearn.preprocessing import robust_scale
from sklearn.preprocessing import add_dummy_feature
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import KernelCenterer
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import normalize
from sklearn.preprocessing import OneHotEncoder

so my question how to run codex I noob with codex so questions are little bit silly
and another question anyone knows where is UI/Tool/site that Openai team used to
demonstrate us

1 Like

I don’t know. There’s no evidence that it respected 3.8. You could test this by specifying a version number that does have a dependency issue like 3.10. Python 3.10 Readiness - Python 3.10 support table for most popular Python packages

1 Like

What do you mean by the “generation part”? I had temp, fp, and pp all set to zero because I saw no value in “creativity”.

1 Like

I get my result from the same prompt when I place the cursor immediately after the third quote mark in the second line of quote marks.

1 Like

That’s a puzzler. The number of max tokens in the request may have something to do with it.

2 Likes

That’s weird, but I have also noticed that sometimes ‘max tokens’ influences the result. OR it could be the good old stochastic GPUs :sweat_smile:

1 Like