Canopy clustering is a simple, fast, accurate method for grouping objects into preliminary clusters. Each object is represented as a point in a multidimensional feature space.
The canopy clustering algorithm uses a fast approximate distance metric and two distance thresholds, T1 and T2 (T1 > T2), for processing. A point can belong to a canopy if the distance from the point to the canopy center is less than T1.
Judicious selection of canopy centers (with no canopy center less than T2 from the next) and points in a canopy enables more efficient execution of clustering algorithms, which are often called within canopies.
- For distance measurement, the Canopy function uses Euclidian distance.
- If there are more than 10,000 canopy centers, the function fails. Run the function again, increasing the value of T2 (specified by the TightDistance argument).
- More expensive distance measurements can be restricted to points inside the canopies, which can significantly reduce their number.
- The more rigorous clustering technique need perform only intra-canopy clustering, which can be parallelized.
Points that belong to different canopies do not have to be considered at the same time in this clustering process.
- Each mapper performs canopy clustering on the points in its input set and outputs its canopy centers (which are local to the mapper).
- The reducer takes all the points in each (local) canopy and calculates centroids to produce the final canopy centers.
- Final canopy centers that are too close to each other are deleted (to eliminate the effects of earlier localization).
A driver extracts information from the initial canopy-generation step and uses it in a SQL-MapReduce call that finishes the clustering process.