TD_BinCodeFit Function | BinCodeFit | Teradata Vantage - TD_BinCodeFit - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
Language
English (United States)
Last Update
2024-04-03
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

TD_BinCodeFit outputs a table that you can use with TD_BinCodeTransform, which converts the specified input table columns.

Bin Coding, also known as binning or bucketing, is a data preprocessing technique used in statistics and machine learning to transform continuous data into categorical or discrete data. In binning, a range of continuous values is divided into smaller intervals or bins, and the data values are assigned to the appropriate bin. This allows the data to be analyzed more easily by grouping similar values into categories.

Binning is used to:
  • Reduce noise in the data by aggregating values into groups.
  • Identify trends and patterns in the data by grouping similar values together.
  • Simplify complex data by reducing the number of unique values.
There are several methods of binning data such as:
  • Equal-width binning: In this method, the range of values is divided into a fixed number of intervals of equal width.
  • Equal-frequency binning: In this method, the range of values is divided into a fixed number of intervals containing an equal number of data points. This is useful when the distribution of data is uneven. For example, if you have data ranging from 0 to 100 and you want to divide it into 5 bins, you would group the values so that each bin contains the same number of data points.
  • Manual binning: In this method, you manually define the intervals or bins based on domain knowledge or specific requirements. This method provides more control over the grouping of values, but can be more subjective.

Equal-width Bin Coding

The width of each bin is given by:

bin width = (max value - min value) / number of bins

where max value and min value are the maximum and minimum values of the data, respectively, and the number of bins is the desired number of intervals.

Example:

Suppose you have the following dataset of 20 values:

5, 7, 10, 13, 18, 19, 23, 25, 26, 28, 32, 34, 35, 38, 40, 41, 45, 47, 49, 52

Divide this data into five bins of equal width. First, find the minimum value = 5 and maximum value = 52.

Calculate the bin width using the formula:

bin width = (max value - min value) / number of bins

(52 - 5) divided by 5 results in 9.4.

Since you cannot have decimal intervals, you can round up to 10. Therefore, the bin width is 10. You can create 5 bins of width 10 as follows:
  • Bin 1: 5-14
  • Bin 2: 15-24
  • Bin 3: 25-34
  • Bin 4: 35-44
  • Bin 5: 45-54

Assign each data point to the appropriate bin. For example, the value 18 falls into the second bin (15-24), while the value 41 falls into the fifth bin (45-54). This process results in the data being transformed into five discrete categories, which can be used for further analysis or visualization.

Manual-width Bin Coding

Manual bin coding involves defining the intervals or bins based on domain knowledge or specific requirements. The mathematical formulation of manual binning has you define the intervals or bins, and then assign the data values to the appropriate bin.

For example, you have the following dataset of 20 values:

5, 7, 10, 13, 18, 19, 23, 25, 26, 28, 32, 34, 35, 38, 40, 41, 45, 47, 49, 52

Divide this data into four bins based on domain knowledge, and bin the data as follows:
  • Bin 1: 0-10
  • Bin 2: 11-25
  • Bin 3: 26-40
  • Bin 4: 41-52

Assign each data point to the appropriate bin. For example, the value 18 falls into the second bin (11-25), while the value 41 falls into the fourth bin (41-52). This process results in the data being transformed into four discrete categories based on our domain knowledge.

When binning data, it's important to consider the appropriate number of bins and the width of each bin. If the bins are too narrow, the data may appear noisy and difficult to interpret. On the other hand, if the bins are too wide, important information may be lost. Binning is a useful technique for simplifying and summarizing complex data, but it must be used with care to ensure that the resulting data is still meaningful and informative.

Bin Coding can help reduce noise, simplify complex data, and identify patterns, but it requires careful consideration of the number of bins and the width of each bin to ensure meaningful and informative results.