Canopy clustering is a simple, fast, accurate method for grouping objects into preliminary clusters. It is often performed in preparation for more rigorous clustering techniques, such as k-means clustering.
The canopy clustering algorithm uses a fast approximate distance metric and two distance thresholds, T1 (loose distance) and T2 (tight distance), with T1 greater than T2. A point is assigned to a canopy if the distance from the point to the canopy center is less than T2, and can be assigned to a canopy if the distance from the point to the canopy center is less than T1. A point can be assigned to more than one canopy.
- For distance measurement, the Canopy function uses Euclidean distance.
- If there are more than 10,000 canopy centers, the function fails. Run the function again, increasing the value of T2 (specified by the TightDistance argument).
- The canopy clustering algorithm is nondeterministic, and the randomness of the canopy assignments cannot be controlled by a seed argument (for more information, see Nondeterministic Results).