| |
- KMeans(data, columns, centers, centroids_data=None, exclude_columns=None, max_iter=50, operator_database=None, threshold=0.001)
- DESCRIPTION:
The function performs fast K-Means clustering algorithm and returns cluster means
and averages. Specifically, the rows associated with positive cluster IDs in output
contain the average values of each of the clustered columns along with the count for
each cluster ID. The rows associated with negative cluster IDs contain the variance
of each of the clustered columns for each cluster ID.
Note:
This function is applicable only on columns containing numeric data.
PARAMETERS:
data:
Required Argument.
Specifies the input data on which K-Means clustering is to be performed.
Types: teradataml DataFrame
columns:
Required Argument.
Specifies the name(s) of the column(s) to be used in clustering. Occasionally,
it can also accept permitted strings to specify all columns or all numeric columns.
Permitted Values:
* Name(s) of the column(s) in "data".
* Pre-defined strings:
* 'all' - all columns
* 'allnumeric' - all numeric columns
Types: str OR list of Strings (str)
centers:
Required Argument.
Specifies the number of clusters to be contained in the cluster model.
Types: int
centroids_data:
Optional Argument.
Specifies the teradataml DataFrame containing clustering output, which is used
as initial value for clustering algorithm, instead of using random values. If
this argument is not specified or None, the function starts with random values.
Types: teradataml DataFrame
exclude_columns:
Optional Argument.
Specifies the name(s) of the column(s) to exclude from the clustering analysis.
If 'all' or 'allnumeric' is used in the "columns" argument, this argument can
be used to exclude specific columns from the analysis.
Types: str OR list of Strings (str)
max_iter:
Optional Argument.
Specifies the maximum number of iterations to perform during clustering.
Default Value: 50
Types: int
operator_database:
Optional Argument.
Specifies the database where the table operators called by Vantage Analytic
Library reside. If not specified, the library searches the standard search path
for table operators, including the current database.
Types: str
threshold:
Optional Argument.
Specifies the value which determines if the algorithm has converged based on
how much the cluster centroids change from one iteration to the next.
Default Value: 0.001
Types: float
RETURNS:
An instance of KMeans.
Output teradataml DataFrames can be accessed using attribute references, such as
KMeansObj.<attribute_name>.
Output teradataml DataFrame attribute name is: result.
Note:
If the argument "centroids_data" is specified, then "centroids_data" DataFrame
is overwritten by the result DataFrame of KMeansObj.
RAISES:
TeradataMlException, TypeError, ValueError
EXAMPLES:
# Notes:
# 1. To execute Vantage Analytic Library functions,
# a. import "valib" object from teradataml.
# b. set 'configure.val_install_location' to the database name where Vantage
# analytic library functions are installed.
# 2. Datasets used in these examples can be loaded using Vantage Analytic Library
# installer.
# Import valib object from teradataml to execute this function.
from teradataml import valib
# Set the 'configure.val_install_location' variable.
from teradataml import configure
configure.val_install_location = "SYSLIB"
# Create the required teradataml DataFrame.
df = DataFrame("customer_analysis")
print(df)
# Example 1: Run KMeans clustering on the DataFrame 'customer_analysis' with initial
# random values. The function uses 'all' for "columns" argument and
# excludes all non-numeric columns.
obj = valib.KMeans(data=df,
columns='all',
exclude_columns=["cust_id", "gender", "marital_status",
"city_name", "state_code"],
centers=3)
# Print the results.
print(obj.result)
# Example 2: Run KMeans clustering on the DataFrame 'customer_analysis' with
# pre-existing result DataFrame in "centroids_data" argument.
# First run KMeans() with initial random values.
obj = valib.KMeans(data=df,
columns=["avg_cc_bal", "avg_ck_bal", "avg_sv_bal"],
centers=3,
max_iter=5,
threshold=0.1)
# Use KMeans result teradataml DataFrame (from above step) in "centroids_data" argument.
ob1 = valib.KMeans(data=df,
columns=["avg_cc_bal", "avg_ck_bal", "avg_sv_bal"],
centers=3,
max_iter=10,
threshold=0.1,
centroids_data=obj.result)
# Print the results.
print(ob1.result)
|