TD_Silhouette Function | Silhouette | Teradata Vantage - TD_Silhouette - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-04-06
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
jmh1512506877710
Product Category
Teradata Vantageā„¢

TD_Silhouette function is a method of interpretation and validation of consistency within clusters of data. The function determines how well the data is clustered among clusters.

The silhouette value determines the similarity of an object to its cluster (cohesion) compared to other clusters (separation). The silhouette plot displays a measure of how close each point in one cluster is to the points in the neighboring clusters and thus provides a way to assess parameters like the optimal number of clusters.

The silhouette scores are as follows:
  • 1: Data is appropriately clustered
  • -1: Data is not appropriately clustered
  • 0: Datum is on the border of two natural clusters
The algorithm used in this function is of the order of N2 (where N is the number of rows). Queries run significantly longer as the number of rows increases in the input table.

The silhouette coefficient can be used to evaluate the performance of clustering algorithms. It is a measure of how well each data point fits its assigned cluster and how different it is from the points in the neighboring clusters.

The silhouette coefficient ranges from -1 to 1, where values closer to 1 indicate that the data point is well-clustered and values closer to -1 indicate that the data point may be assigned to the wrong cluster. A value of 0 indicates that the data point is close to the decision boundary between two clusters.

The silhouette coefficient can be used to compare the performance of different clustering algorithms or to optimize the hyperparameters of a specific algorithm. The algorithm that produces the highest silhouette coefficient is usually considered to be the best choice for clustering the dataset.

Different clustering algorithms may perform differently depending on the characteristics of the dataset, and the silhouette coefficient can help identify which algorithm is best suited for a specific dataset. For example, density-based clustering algorithms may perform better on datasets with complex shapes and varying densities, while hierarchical clustering algorithms may perform better on datasets with clear hierarchical structures.

The most commonly used clustering algorithms follow:
  • K-means: Used to classify n-observations in k clusters. The algorithm starts with randomly chosen centroids and then assigns each point to the nearest centroid. It then recalculates the centroids and repeats the process until the centroids stabilize. K-means is simple, efficient, and widely used for clustering data in different industries, such as finance, biology, social sciences, and image processing.
  • Hierarchical Clustering: Used to group similar data points into nested clusters by using either agglomerative or divisive methods. In agglomerative clustering, each point starts in a separate cluster and pairs of clusters are merged based on their similarity. Divisive clustering starts with all the points in one cluster and recursively splits them into smaller clusters. Hierarchical clustering is often used in biology, text analysis, and social networks.
  • DBSCAN: Density-Based Spatial Clustering of Applications with Noise is a clustering algorithm that identifies clusters based on their density. It starts by randomly picking a data point, collects its neighbors in a given radius, and then expands the clusters by including their neighbors until the density reaches a minimum threshold. DBSCAN is used when the clusters are irregularly shaped or of different sizes, and when the noise/outliers need to be detected.
  • Mean Shift: This algorithm attempts to find the local maximum of a density function by shifting the centroid of a cluster over a feature space. The algorithm starts by defining a kernel function that weighs the distance of data points from the centroid. The centroid is then shifted to the area with the highest density of data points until convergence. Mean shift is used in computer vision, image processing, and object tracking.
  • Spectral Clustering: Spectral clustering is based on graph theory, where the data is viewed as a graph with pairwise similarities as edges between nodes. The algorithm reduces the graph to k clusters by computing the Laplacian matrix of the graph, diagonalizing it, and then clustering the eigenvectors of the matrix. Spectral clustering is useful when the data is non-linearly separable, and when there is a clear graphical representation of the data. It is used in social networks analysis, image segmentation, and recommendation systems.

After clustering, the silhouette coefficient is calculated to assess the overall quality of clustering and to compare the clustering solutions of different algorithms on the same dataset. It is a useful metric to evaluate the quality of clustering solutions. It provides a quantitative measure of how well each data point is assigned to its cluster, based on its distance from other points in the same cluster and from points in neighboring clusters.

The silhouette coefficient is calculated for each data point i as follows:
  • Compute the average distance (a_i) between data point i and all other points in the same cluster.
  • Compute the average distance (b_i) between data point i and all points in the nearest neighboring cluster.
  • Compute the silhouette coefficient for data point i as (b_i - a_i) / max(a_i, b_i).
  • Compute the overall Silhouette coefficient as the average of all Silhouette coefficients for all data points.

In the formula, the denominator represents the maximum of a_i and b_i, which ensures that the silhouette coefficient falls within the range of -1 to 1. If a_i is much smaller than b_i, the data point i is well-clustered, and the silhouette coefficient approaches 1. If a_i is much larger than b_i, the data point i may be assigned to the wrong cluster, and the silhouette coefficient approaches -1. If a_i and b_i are similar, the data point i is near the decision boundary between two clusters, and the silhouette coefficient approaches 0.

Suppose you have a dataset with five data points, and you apply K-means clustering with K=2. You compute the Euclidean distance between each pair of data points and obtain the following distance matrix:

  1 2 3 4 5
1 0 2 6 4 3
2 2 0 5 3 2
3 6 5 0 6 7
4 4 3 6 0 5
5 3 2 7 5 0
You assign data points 1, 2, and 3 to cluster A and data points 4 and 5 to cluster B. You then compute the silhouette coefficient for each data point:
  • Data point 1:

    a_1 = (0+2)/2 = 1, and b_1 = 3

    The silhouette coefficient for data point 1 is (3-1)/3 = 0.67.

  • Data point 2:

    a_2 = (0+2)/2 = 1, and b_2 = 3

    The silhouette coefficient for data point 2 is (3-1)/3 = 0.67.

  • Data point 3:

    a_3 = (6+5)/2 = 5.5, and b_3 = 3

    The silhouette coefficient for data point 3 is (3-5.5)/5.5 = -0.45.

  • Data point 4:

    a_4 = (0+3)/2 = 1.5, and b_4 = 2

    The silhouette coefficient for data point 4 is (2-1.5)/2 = 0.25.

  • Data point 5:

    a_5 = (0+2)/2 = 1, and b_5 = 3

    The silhouette coefficient for data point 5 is (3-1)/3 = 0.67.

The average silhouette coefficient for the entire dataset is a measure of the overall quality of the clustering algorithm. Higher average silhouette coefficients indicate better clustering, while lower average silhouette coefficients suggest that the clusters may be poorly defined or overlapping.

The silhouette coefficient is a useful measure for evaluating the quality of clustering algorithms. It considers both the intra-cluster and inter-cluster distances to determine how well a data point fits into its assigned cluster, compared to how well it could fit into its nearest neighbouring cluster. A high silhouette coefficient value indicates that the clustering algorithm separated the data points into distinct and well-defined clusters, while a low value suggests that the clusters may be overlapping or poorly defined.Do not use the silhouette coefficient in isolation to determine the quality of clustering. Consider other measures, such as domain-specific knowledge and visualization techniques, in evaluating the results of a clustering algorithm