If requested, a Histogram analysis is performed on every numeric and date column for which a Frequency analysis was not performed. The analysis is performed either with a user specified number of equal-width bins (by default 10) between the minimum and maximum values encountered in the data, or with the same number of quantile bins, depending on option settings. There are many similarities in Histogram and Frequency processing. Histogram processing may even be thought of as Frequency processing where the data is first ‘bin-coded’. That is, numeric values are first replaced with the histogram bin or bar that they fall into.
As with a Frequency analysis, a strategy for efficiently calculating histograms or bins must be determined. One strategy is simply to calculate each histogram individually (i.e., one column at a time). The other strategy is to combine columns into an intermediate table of counts and then select individual column histogram data from the intermediate table. This can enhance performance dramatically in cases where there are not too many combinations of bins and where there are enough rows to make the effort worth while. Too many combined bin values can however lead to greatly degraded performance.
- The minimum number of rows to use the combining strategy with, by default 25000.
- The maximum number of possible combined values (in this case, bins) in combined columns, by default 10,000.
Columns of type DATE are handled by subtracting the date ‘1900-01-01’ from the date and from the minimum and maximum values, so that the calculations are based on the number of days since 1900.