1.0 - 8.00 - Canopy - Teradata Vantage

Teradata® Vantage Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
1.0
8.00
Release Date
May 2019
Content Type
Programming Reference
Publication ID
B700-4003-098K
Language
English (United States)

Canopy clustering is a simple, fast, accurate method for grouping objects into preliminary clusters. It is often performed in preparation for more rigorous clustering techniques, such as k-means clustering.

The canopy clustering algorithm uses a fast approximate distance metric and two distance thresholds, T1 (loose distance) and T2 (tight distance), with T1 greater than T2. A point is assigned to a canopy if the distance from the point to the canopy center is less than T2, and can be assigned to a canopy if the distance from the point to the canopy center is less than T1. A point can be assigned to more than one canopy.

  • For distance measurement, the Canopy function uses Euclidean distance.
  • If there are more than 10,000 canopy centers, the function fails. Run the function again, increasing the value of T2 (specified by the TightDistance argument).
  • The canopy clustering algorithm is nondeterministic, and the randomness of the canopy assignments cannot be controlled by a seed argument (for more information, see Nondeterministic Results).