1.1 - 8.10 - Canopy (ML Engine) - Teradata Vantage

Teradata Vantage™ - Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.1
8.10
Release Date
October 2019
Content Type
Programming Reference
Publication ID
B700-4003-079K
Language
English (United States)

Canopy clustering is a simple, fast, accurate method for grouping objects into preliminary clusters. It is often performed in preparation for more rigorous clustering techniques, such as k-means clustering.

The canopy clustering algorithm uses a fast approximate distance metric and two distance thresholds, T1 (loose distance) and T2 (tight distance), with T1 greater than T2. A point is assigned to a canopy if the distance from the point to the canopy center is less than T2, and can be assigned to a canopy if the distance from the point to the canopy center is less than T1. A point can be assigned to more than one canopy.

For distance measurement, the Canopy function uses Euclidean distance.

If there are more than 10,000 canopy centers, the function fails. Run the function again, increasing the value of T2 (specified by the TightDistance syntax element).

The canopy clustering algorithm is nondeterministic, and the randomness of the canopy assignments cannot be controlled by a seed syntax element (for more information, see Nondeterministic Results and UniqueID Syntax Element).