Teradata Package for Python Function Reference | 17.10 - KMeans - Teradata Package for Python - Look here for syntax, methods and examples for the functions included in the Teradata Package for Python.

Teradata® Package for Python Function Reference

Product

Teradata Package for Python

Release Number

17.10

Published

April 2022

Language

English (United States)

Last Update

2022-08-19

lifecycle

Product Category

Teradata Vantage

Kmeans

Functions
		KMeans(data, columns, centers, centroids_data=None, exclude_columns=None, max_iter=50, operator_database=None, threshold=0.001) DESCRIPTION: The function performs fast K-Means clustering algorithm and returns cluster means and averages. Specifically, the rows associated with positive cluster IDs in output contain the average values of each of the clustered columns along with the count for each cluster ID. The rows associated with negative cluster IDs contain the variance of each of the clustered columns for each cluster ID. Note: This function is applicable only on columns containing numeric data. PARAMETERS: data: Required Argument. Specifies the input data on which K-Means clustering is to be performed. Types: teradataml DataFrame columns: Required Argument. Specifies the name(s) of the column(s) to be used in clustering. Occasionally, it can also accept permitted strings to specify all columns or all numeric columns. Permitted Values: * Name(s) of the column(s) in "data". * Pre-defined strings: * 'all' - all columns * 'allnumeric' - all numeric columns Types: str OR list of Strings (str) centers: Required Argument. Specifies the number of clusters to be contained in the cluster model. Types: int centroids_data: Optional Argument. Specifies the teradataml DataFrame containing clustering output, which is used as initial value for clustering algorithm, instead of using random values. If this argument is not specified or None, the function starts with random values. Types: teradataml DataFrame exclude_columns: Optional Argument. Specifies the name(s) of the column(s) to exclude from the clustering analysis. If 'all' or 'allnumeric' is used in the "columns" argument, this argument can be used to exclude specific columns from the analysis. Types: str OR list of Strings (str) max_iter: Optional Argument. Specifies the maximum number of iterations to perform during clustering. Default Value: 50 Types: int operator_database: Optional Argument. Specifies the database where the table operators called by Vantage Analytic Library reside. If not specified, the library searches the standard search path for table operators, including the current database. Types: str threshold: Optional Argument. Specifies the value which determines if the algorithm has converged based on how much the cluster centroids change from one iteration to the next. Default Value: 0.001 Types: float RETURNS: An instance of KMeans. Output teradataml DataFrames can be accessed using attribute references, such as KMeansObj.<attribute_name>. Output teradataml DataFrame attribute name is: result. Note: If the argument "centroids_data" is specified, then "centroids_data" DataFrame is overwritten by the result DataFrame of KMeansObj. RAISES: TeradataMlException, TypeError, ValueError EXAMPLES: # Notes: # 1. To execute Vantage Analytic Library functions, # a. import "valib" object from teradataml. # b. set 'configure.val_install_location' to the database name where Vantage # analytic library functions are installed. # 2. Datasets used in these examples can be loaded using Vantage Analytic Library # installer. # Import valib object from teradataml to execute this function. from teradataml import valib # Set the 'configure.val_install_location' variable. from teradataml import configure configure.val_install_location = "SYSLIB" # Create the required teradataml DataFrame. df = DataFrame("customer_analysis") print(df) # Example 1: Run KMeans clustering on the DataFrame 'customer_analysis' with initial # random values. The function uses 'all' for "columns" argument and # excludes all non-numeric columns. obj = valib.KMeans(data=df, columns='all', exclude_columns=["cust_id", "gender", "marital_status", "city_name", "state_code"], centers=3) # Print the results. print(obj.result) # Example 2: Run KMeans clustering on the DataFrame 'customer_analysis' with # pre-existing result DataFrame in "centroids_data" argument. # First run KMeans() with initial random values. obj = valib.KMeans(data=df, columns=["avg_cc_bal", "avg_ck_bal", "avg_sv_bal"], centers=3, max_iter=5, threshold=0.1) # Use KMeans result teradataml DataFrame (from above step) in "centroids_data" argument. ob1 = valib.KMeans(data=df, columns=["avg_cc_bal", "avg_ck_bal", "avg_sv_bal"], centers=3, max_iter=10, threshold=0.1, centroids_data=obj.result) # Print the results. print(ob1.result)