Clustering Using KMeans with Teradata Python Package - Teradata Python Package

Teradata® Python Package User Guide

Product
Teradata Python Package
Release Number
16.20
Published
February 2020
Language
English (United States)
Last Update
2020-02-29
dita:mapPath
rkb1531260709148.ditamap
dita:ditavalPath
Generic_no_ie_no_tempfilter.ditaval
dita:id
B700-4006
lifecycle
previous
Product Category
Teradata Vantage
This example uses the US Arrests data of 50 samples containing statistics for arrests made per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973, along with the percentage of the population living in urban areas.

Example here shows how to use KMeans clustering function.

  1. Import the required modules.
    from teradataml.analytics.mle.KMeans import KMeans
    from teradataml.dataframe.dataframe import DataFrame
    from teradataml.data.load_example_data import load_example_data
    import matplotlib.pyplot as plt
  2. If the input table "kmeans_us_arrests_data" does not already exist, create the table and load the datasets into the table.
    load_example_data("kmeans", "kmeans_us_arrests_data")
  3. Create a teradataml DataFrame from "kmeans_us_arrests_data".
    ## Creating TeradataML dataframes
    df_train = DataFrame('kmeans_us_arrests_data')
    
    # Print train data to see how the sample train data looks like
    print("\nHead(10) of Train data:")
    print(df_train.head(10))
  4. In the training dataset, the features are 'sno', 'state', 'murder', 'assault', 'urban_pop' and 'rape'. And 'sno' to 'state' is a one-to-one mapping. For training, drop off the repeated feature 'state'.
    # A dictionary to get 'sno' to 'state' mapping. Required for plotting.
    df1 = df_train.select(['sno', 'state'])
    sno_to_state = dict(df1.to_pandas()['state'])
    print(sno_to_state)
    
    # No need of 'state' columns, instead we have 'sno' column for the same
    df_train = df_train.drop(['state'], axis=1)
    colnames = ["sno", "murder", "assault", "urban_pop", "rape"]
  5. Apply KMeans algorithm to generate two clusters and inspect the outputs.
    1. Apply KMeans algorithm.
      ## KMmeans algorithm on data, with 2 centroids
      kMeans_output = KMeans(data=df_train, centers=2, data_sequence_column=['sno'])
      
      # Print the KMeans results
      print(kMeans_output)
    2. The 'cluster_centroids' output dataframe presents the centroid values of the two clusters, the withinness and the number of samples in each cluster out of all the samples.
      # Features (4-dimensional) used to train KMeans algorithm.
      features = 'murder assault urban_pop rape'.split()
      
      # Feature Centroids for cluster 1
      centroid_values = kMeans_output.clusters_centroids.to_pandas()['murder assault urban_pop rape'][0].split()
      print("4-dimensional Centroid of Cluster 1:")
      print(dict(zip(features, centroid_values)))
      
      # Feature Centroids for cluster 2
      centroid_values = kMeans_output.clusters_centroids.to_pandas()['murder assault urban_pop rape'][1].split()
      print("4-dimensional Centroid of Cluster 2:")
      print(dict(zip(features, centroid_values)))
      Check the withinness to see if the within is the cluster sum of squares (from teradataml DataFrame output 'clusters_centroids').
      # Withinness for cluster 1
      print("Withinness for cluster 1: " + str(kMeans_output.clusters_centroids.to_pandas()['withinss'][0]))
      
      # Withinness for cluster 2
      print("Withinness for cluster 2: " + str(kMeans_output.clusters_centroids.to_pandas()['withinss'][1]))
    3. The 'clustered_output' dataframe presents the 'sno' representing state and the corresponding mapping number to cluster for all the samples.
      kMeans_output.clustered_output.head(30)
    4. The 'output' dataframe presents overall summary of the KMeans output.
      print(kMeans_output.output)
  6. Plot clusters based on feature to analyze the clustering output.
    ## Inner join of clustered_output to actual dataset df_train We shall use the data from df1 to plot.
    df1 = df_train.join(kMeans_output.clustered_output, how='inner', on=['sno'], lsuffix='t1',
                        rsuffix='t2')
    
    print("\nInner join of clustered_output to actual dataset df_train:")
    print(df1)
    1. Plot clusters based on the two features 'urban_pop' and 'murder'.
      # Selecting only the necessary features for plot.
      df3 = df1.select(['t1_sno', 'urban_pop', 'murder', 'clusterid'])
      # Since there is no plotting possible for teradataml DataFrame, we are converting it to
      # pandas dataframe and then to numpy_array 'numpy_df' to use matplotlib library of python.
      pandas_df = df3.to_pandas()
      numpy_df = pandas_df.values
      
      # Setting figure display size.
      plt.rcParams['figure.figsize'] = [15, 10]
      
      # Coloring based on cluster_id.
      plt.scatter(numpy_df[:,1], numpy_df[:,2], c=numpy_df[:,3])
      for ind, value in enumerate(numpy_df[:, 0]):
          # sno_to_state is used hear to get state names.
          plt.text(numpy_df[ind,1], numpy_df[ind,2], sno_to_state[int(value)], fontsize=14)
      plt.xlabel('urban_pop')
      plt.ylabel('murder')
      plt.show()
      Run the above commands and the following plot shows:

      Discover the plotted clusters of Teradata Python Clustering example 6a.

    2. Plot the clusters based on the two features 'rape' and 'murder'.
      # Selecting only the necessary features for plot.
      df3 = df1.select(['t1_sno', 'rape', 'murder', 'clusterid'])
      
      # Since there is no plotting possible for teradataml DataFrame, we are converting it to
      # pandas dataframe and then to numpy_array 'numpy_df' to use matplotlib library of python.
      pandas_df = df3.to_pandas()
      numpy_df = pandas_df.values
      
      # Coloring based on cluster_id.
      plt.scatter(numpy_df[:,1], numpy_df[:,2], c=numpy_df[:,3])
      for ind, value in enumerate(numpy_df[:, 0]):
          # sno_to_state is used hear to get state names.
          plt.text(numpy_df[ind,1], numpy_df[ind,2], sno_to_state[int(value)], fontsize=14)
      plt.xlabel('rape')
      plt.ylabel('murder')
      plt.show()
      Run the above commands and the following plot shows:
      Discover the plotted clusters of Teradata Python Clustering example 6b.