Clustering Using KMeans with teradataml Package - Clustering Using KMeans with teradataml Package - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-02-17
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

This example uses the US Arrests data of 50 samples containing statistics for arrests made per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973, along with the percentage of the population living in urban areas.

Example here shows how to use KMeans clustering function.

  1. Import the required modules.
    from teradataml import KMeans, KMeansPredict
    from teradataml.dataframe.dataframe import DataFrame
    from teradataml.data.load_example_data import load_example_data
    import matplotlib.pyplot as plt
  2. If the input table "kmeans_us_arrests_data" does not already exist, create the table and load the datasets into the table.
    load_example_data("kmeans", "kmeans_us_arrests_data")
  3. Create a teradataml DataFrame from "kmeans_us_arrests_data".
    ## Creating TeradataML dataframes
    df_train = DataFrame('kmeans_us_arrests_data')
    
    # Print train data to see how the sample train data looks like
    print("\nHead(10) of Train data:")
    print(df_train.head(10))
  4. In the training dataset, the features are 'sno', 'state', 'murder', 'assault', 'urban_pop' and 'rape'. And 'sno' to 'state' is a one-to-one mapping. For training, drop off the repeated feature 'state'.
    # A dictionary to get 'sno' to 'state' mapping. Required for plotting.
    df1 = df_train.select(['sno', 'state'])
    sno_to_state = dict(df1.to_pandas()['state'])
    print(sno_to_state)
    # No need of 'state' columns, instead we have 'sno' column for the same
    df_train = df_train.drop(['state'], axis=1)
    colnames = ["sno", "murder", "assault", "urban_pop", "rape"]
  5. Apply KMeans algorithm to generate two clusters and inspect the outputs.
    1. Apply KMeans algorithm.
      ## Train the KMeans model with 2 clusters.
      kMeans_model = KMeans(id_column="sno",
                            target_columns=['murder', 'assault', 'urban_pop', 'rape'],
                            data=df_train,
                            num_init=10,
                            num_clusters=2)
      # Print the KMeans model training results
      print(kMeans_model.result)
    2. 'model_data' dataframe presents overall summary of the trained KMeans model.
      print(kMeans_model.model_data)
    3. Use KMeansPredict function to perform predictions using trained KMeans model.
      # KMeansPredict function performs prediction on the dataset using trained KMeans model.
      kMeans_output =  KMeansPredict(object=kMeans_model.result,
                           data=df_train)
    4. 'result' dataframe presents the 'sno' representing state and the corresponding cluster id for all the samples.
      kMeans_output.result.head(30)
  6. Quick Analysis of clustering output by plotting clusters based on features.
    ## Inner join of clustered_output to actual dataset df_train We shall use the data from df1 to plot.
    df1 = df_train.join(kMeans_output.clustered_output, how='inner', on=['sno'], lsuffix='t1', rsuffix='t2')
    print("\nInner join of clustered_output to actual dataset df_train:")
    print(df1)
    1. Plot clusters based on the two features 'urban_pop' and 'murder'.
      # Selecting only the necessary features for plot.
      df3 = df1.select(['t1_sno', 'urban_pop', 'murder', 'td_clusterid_kmeans'])
      # Since there is no plotting possible for teradataml DataFrame, we are converting it to
      # pandas dataframe and then to numpy_array 'numpy_df' to use matplotlib library of python.
      pandas_df = df3.to_pandas()
      numpy_df = pandas_df.values
      # Setting figure display size.
      plt.rcParams['figure.figsize'] = [15, 10]
      # Coloring based on 'td_clusterid_kmeans'.
      plt.scatter(numpy_df[:,1], numpy_df[:,2], c=numpy_df[:,3], cmap='winter_r', alpha=0.4)
      for ind, value in enumerate(numpy_df[:, 0]):
          # sno_to_state is used hear to get state names.
          plt.text(numpy_df[ind,1], numpy_df[ind,2], sno_to_state[int(value)], fontsize=14)
      plt.xlabel('urban_pop')
      plt.ylabel('murder')
      plt.show()
      After running these commands, the following plot shows:
    2. Plot the clusters based on the two features 'rape' and 'murder'.
      # Selecting only the necessary features for plot.
      df3 = df1.select(['t1_sno', 'rape', 'murder', 'td_clusterid_kmeans'])
      
      # Since there is no plotting possible for teradataml DataFrame, we are converting it to
      # pandas dataframe and then to numpy_array 'numpy_df' to use matplotlib library of python.
      pandas_df = df3.to_pandas()
      numpy_df = pandas_df.values
      # Coloring based on 'td_clusterid_kmeans'.
      plt.scatter(numpy_df[:,1], numpy_df[:,2], c=numpy_df[:,3], cmap='winter_r', marker='^', alpha=0.4)
      for ind, value in enumerate(numpy_df[:, 0]):
          # sno_to_state is used hear to get state names.
          plt.text(numpy_df[ind,1], numpy_df[ind,2], sno_to_state[int(value)], fontsize=14)
      plt.xlabel('rape')
      plt.ylabel('murder')
      plt.show()
      After running these command, the following plot shows: