Canopy (ML Engine) - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.10

1.1

Published

October 2019

Language

English (United States)

Last Update

2019-12-31

dita:mapPath

ima1540829771750.ditamap

dita:ditavalPath

jsj1481748799576.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

Canopy clustering is a simple, fast, accurate method for grouping objects into preliminary clusters. It is often performed in preparation for more rigorous clustering techniques, such as k-means clustering.

The canopy clustering algorithm uses a fast approximate distance metric and two distance thresholds, T1 (loose distance) and T2 (tight distance), with T1 greater than T2. A point is assigned to a canopy if the distance from the point to the canopy center is less than T2, and can be assigned to a canopy if the distance from the point to the canopy center is less than T1. A point can be assigned to more than one canopy.

For distance measurement, the Canopy function uses Euclidean distance.

If there are more than 10,000 canopy centers, the function fails. Run the function again, increasing the value of T2 (specified by the TightDistance syntax element).

The canopy clustering algorithm is nondeterministic, and the randomness of the canopy assignments cannot be controlled by a seed syntax element (for more information, see Nondeterministic Results and UniqueID Syntax Element).