This example uses the US Arrests data of 50 samples containing statistics for arrests made per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973, along with the percentage of the population living in urban areas.
Example here shows how to use KMeans clustering function.
- Import the required modules.
from teradataml import KMeans, KMeansPredict from teradataml.dataframe.dataframe import DataFrame from teradataml.data.load_example_data import load_example_data import matplotlib.pyplot as plt
- If the input table "kmeans_us_arrests_data" does not already exist, create the table and load the datasets into the table.
load_example_data("kmeans", "kmeans_us_arrests_data")
- Create a teradataml DataFrame from "kmeans_us_arrests_data".
## Creating TeradataML dataframes df_train = DataFrame('kmeans_us_arrests_data')
# Print train data to see how the sample train data looks like print("\nHead(10) of Train data:") print(df_train.head(10))
- In the training dataset, the features are 'sno', 'state', 'murder', 'assault', 'urban_pop' and 'rape'. And 'sno' to 'state' is a one-to-one mapping. For training, drop off the repeated feature 'state'.
# A dictionary to get 'sno' to 'state' mapping. Required for plotting. df1 = df_train.select(['sno', 'state']) sno_to_state = dict(df1.to_pandas()['state']) print(sno_to_state)
# No need of 'state' columns, instead we have 'sno' column for the same df_train = df_train.drop(['state'], axis=1) colnames = ["sno", "murder", "assault", "urban_pop", "rape"]
- Apply KMeans algorithm to generate two clusters and inspect the outputs.
- Apply KMeans algorithm.
## Train the KMeans model with 2 clusters. kMeans_model = KMeans(id_column="sno", target_columns=['murder', 'assault', 'urban_pop', 'rape'], data=df_train, num_init=10, num_clusters=2)
# Print the KMeans model training results print(kMeans_model.result)
- 'model_data' dataframe presents overall summary of the trained KMeans model.
print(kMeans_model.model_data)
- Use KMeansPredict function to perform predictions using trained KMeans model.
# KMeansPredict function performs prediction on the dataset using trained KMeans model. kMeans_output = KMeansPredict(object=kMeans_model.result, data=df_train)
- 'result' dataframe presents the 'sno' representing state and the corresponding cluster id for all the samples.
kMeans_output.result.head(30)
- Apply KMeans algorithm.
- Quick Analysis of clustering output by plotting clusters based on features.
## Inner join of clustered_output to actual dataset df_train We shall use the data from df1 to plot. df1 = df_train.join(kMeans_output.clustered_output, how='inner', on=['sno'], lsuffix='t1', rsuffix='t2')
print("\nInner join of clustered_output to actual dataset df_train:") print(df1)
- Plot clusters based on the two features 'urban_pop' and 'murder'.
# Selecting only the necessary features for plot. df3 = df1.select(['t1_sno', 'urban_pop', 'murder', 'td_clusterid_kmeans'])
# Since there is no plotting possible for teradataml DataFrame, we are converting it to # pandas dataframe and then to numpy_array 'numpy_df' to use matplotlib library of python. pandas_df = df3.to_pandas() numpy_df = pandas_df.values
# Setting figure display size. plt.rcParams['figure.figsize'] = [15, 10]
# Coloring based on 'td_clusterid_kmeans'. plt.scatter(numpy_df[:,1], numpy_df[:,2], c=numpy_df[:,3], cmap='winter_r', alpha=0.4) for ind, value in enumerate(numpy_df[:, 0]): # sno_to_state is used hear to get state names. plt.text(numpy_df[ind,1], numpy_df[ind,2], sno_to_state[int(value)], fontsize=14)
plt.xlabel('urban_pop') plt.ylabel('murder') plt.show()
After running these commands, the following plot shows: - Plot the clusters based on the two features 'rape' and 'murder'.
# Selecting only the necessary features for plot. df3 = df1.select(['t1_sno', 'rape', 'murder', 'td_clusterid_kmeans'])
# Since there is no plotting possible for teradataml DataFrame, we are converting it to # pandas dataframe and then to numpy_array 'numpy_df' to use matplotlib library of python. pandas_df = df3.to_pandas() numpy_df = pandas_df.values
# Coloring based on 'td_clusterid_kmeans'. plt.scatter(numpy_df[:,1], numpy_df[:,2], c=numpy_df[:,3], cmap='winter_r', marker='^', alpha=0.4) for ind, value in enumerate(numpy_df[:, 0]): # sno_to_state is used hear to get state names. plt.text(numpy_df[ind,1], numpy_df[ind,2], sno_to_state[int(value)], fontsize=14)
plt.xlabel('rape') plt.ylabel('murder') plt.show()
After running these command, the following plot shows:
- Plot clusters based on the two features 'urban_pop' and 'murder'.