Silhouette
Description
The td_silhouette_sqle()
function refers to a method of interpretation and validation of
consistency within clusters of data.
The function determines how well the data is clustered among clusters.
The silhouette value determines the similarity of an object to its cluster (cohesion) compared to
other clusters (separation). The silhouette plot displays a measure of how close each point in one
cluster is to the points in the neighbouring clusters and thus provides a way to assess parameters
like the optimal number of clusters.
The silhouette scores and its definitions are as follows:
Data is not appropriately clustered
Datum is on the border of two natural clusters
Notes:
The algorithm used in this function is of the order of N*N (where N is the number of rows). Hence, expect the query to run significantly longer as the number of rows increases in the input data.
This function requires the UTF8 client character set for UNICODE data.
This function does not support Pass Through Characters (PTCs). For information about PTCs, see Teradata Vantage™ - Analytics Database International Character Set Support.
This function does not support KanjiSJIS or Graphic data types.
Usage
td_silhouette_sqle (
data = NULL,
accumulate = NULL,
id.column = NULL,
cluster.id.column = NULL,
target.columns = NULL,
output.type = "SCORE",
...
)
Arguments
data |
Required Argument. |
accumulate |
Optional Argument. |
id.column |
Required Argument. |
cluster.id.column |
Required Argument. |
target.columns |
Required Argument. |
output.type |
Optional Argument.
Default Value: "SCORE" |
... |
Specifies the generic keyword arguments SQLE functions accept. persist: volatile: Function allows the user to partition, hash, order or local order the input data. These generic arguments are available for each argument that accepts tbl_teradata as input and can be accessed as:
Note: |
Value
Function returns an object of class "td_silhouette_sqle"
which is a named list containing object of class "tbl_teradata".
Named list member(s) can be referenced directly with the "$" operator
using the name(s):result
Examples
# Get the current context/connection.
con <- td_get_context()$connection
# Load the example data.
loadExampleData("tdplyr_example", "mobile_data")
# Create tbl_teradata object.
mobile_data <- tbl(con, "mobile_data")
# Check the list of available analytic functions.
display_analytic_functions()
# Example 1: Find the silhouette score for each input sample.
Silhouette_result1 <- td_silhouette_sqle(
accumulate=c('feature'),
id.column="row_id",
cluster.id.column="userid",
target.columns='value1',
output.type="SAMPLE_SCORES",
data=mobile_data)
# Print the result.
print(Silhouette_result1$result)
# Example 2: Find average silhouette score of all input samples.
Silhouette_result2 <- td_silhouette_sqle(
id.column="row_id",
cluster.id.column="userid",
target.columns=c("value1"),
data=mobile_data,
output.type="SCORE")
# Print the result.
print(Silhouette_result2$result)
# Example 3: Find average silhouette scores of input samples for each cluster.
Silhouette_result3 <- td_silhouette_sqle(
id.column="row_id",
cluster.id.column="userid",
target.columns=c("value1"),
data=mobile_data,
output.type="CLUSTER_SCORES")
# Print the result.
print(Silhouette_result3$result)