KMeans
Description
The td_means_sqle()
function groups a set of observations into k clusters
in which each observation belongs to the cluster with the nearest mean
(cluster centers or cluster centroid). This algorithm minimizes the
objective function, that is, the total Euclidean distance of all data points
from the center of the cluster as follows:
Specify or randomly select k initial cluster centroids.
Assign each data point to the cluster that has the closest centroid.
Recalculate the positions of the k centroids.
Repeat steps 2 and 3 until the centroids no longer move.
The algorithm doesn't necessarily find the optimal configuration as it
depends significantly on the initial randomly selected cluster centers.
User can run the function multiple times to reduce the effect of this limitation.
Also, this function returns the within-cluster-squared-sum, which user can use to
determine an optimal number of clusters using the Elbow method.
Notes:
This function doesn't consider the "data" and "centroids.data" input rows that have a NULL entry in the specified "target.columns".
The function can produce deterministic output across different machine configurations if user provide the "centroids.data".
The function randomly samples the initial centroids from the "data", if "centroids.data" not provided. In this case, use of "seed" argument makes the function output deterministic on a machine with an assigned configuration. However, using the "seed" argument won't guarantee deterministic output across machines with different configurations.
This function requires the UTF8 client character set for UNICODE data.
This function does not support Pass Through Characters (PTCs).
For information about PTCs, see Teradata Vantage™ - Analytics Database International Character Set Support.
This function does not support KanjiSJIS or Graphic data types.
Usage
td_kmeans_sqle (
data = NULL,
centroids.data = NULL,
id.column = NULL,
target.columns = NULL,
num.clusters = NULL,
seed = NULL,
threshold = 0.0395,
iter.max = 10,
num.init = 1,
output.cluster.assignment = FALSE,
...
)
Arguments
data |
Required Argument. |
centroids.data |
Optional Argument. |
id.column |
Required Argument. |
target.columns |
Required Argument. |
num.clusters |
Optional Argument. |
seed |
Optional Argument.
Types: integer |
threshold |
Optional Argument. |
iter.max |
Optional Argument. |
num.init |
Optional Argument. |
output.cluster.assignment |
Optional Argument. |
... |
Specifies the generic keyword arguments SQLE functions accept. Below volatile: Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:
Note: |
Value
Function returns an object of class "td_kmeans_sqle"
which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator
using the name(s):
result
model.data
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load the example data.
loadExampleData("kmeans_example", "computers_train1")
# Create tbl_teradata object.
computers_train1 <- tbl(con, "computers_train1")
# Check the list of available analytic functions.
display_analytic_functions()
# Example 1 : Grouping a set of observations into 2 clusters in which
# each observation belongs to the cluster with the nearest mean.
KMeans_out <- td_kmeans_sqle(id.column="id",
target.columns=c('price', 'speed'),
data=computers_train1,
num.clusters=2)
# Print the result tbl_teradata objects.
print(KMeans_out$result)
print(KMeans_out$model.data)
# Example 2 : Grouping a set of observations by specifying initial
# centroid data.
# Get the set of initial centroids by accessing the group of rows
# from input data.
kmeans_initial_centroids_table <- computers_train1
KMeans_out_1 <- td_kmeans_sqle(id.column="id",
target.columns=c('price', 'speed'),
data=computers_train1,
centroids.data=kmeans_initial_centroids_table)
# Print the result tbl_teradata objects.
print(KMeans_out_1$result)
print(KMeans_out_1$model.data)