TD_UnivariateStatistics displays descriptive statistics for each specified numeric input table column.
Univariate is a term used in statistics to describe a type of data that consists of observations on only a single characteristic or attribute, such as the salaries of workers in an industry. This type of data can be either categorical or numeric.
There are different ways to describe patterns found in univariate data, including measures of central tendency, measures of variability, and graphical representations.
Measure of Central Tendency
Central tendency. Used to estimate the central location of the univariate data by the calculation of mean, median, and mode.
- Mean: Sum of calculated by adding the values divided by the total number of values. The mean is sensitive to outliers, which can skew the result.
- Median: The middle value in a set of sorted data. The median is less sensitive to outliers than the mean.
- Mode: The most frequently occurring values in a set of data.
- Geometric Mean: Measure of central tendency that is calculated by taking the nth root of the product of n values. It is used to calculate average growth rates, ratios, and other values that involve multiplication. The formula for geometric mean is:where:
- ∏ is product of …
- xi is the point in a dataset
- n is total number of values
- Harmonic Mean: Measure of central tendency that is calculated by taking the reciprocal of the arithmetic mean of the reciprocals of n values. It is used to calculate average rates, such as average speed or average distance per unit of time. The general formula for calculating a harmonic mean is:where:
- n is the number of the values in a dataset
- xi is the point in a dataset
The weighted harmonic mean can be calculated using the following formula:
where:- wj is the weight of the data point
- xi is the point in a dataset
- Trimmed Mean: Method of averaging that removes a small, designated percentage of the largest and smallest values before calculating the mean. After removing the specified outlier observations, the trimmed mean is found using a standard arithmetic averaging formula.
Measure of Variability
A measure of variability or dispersion (deviation from the mean) of a univariate data set can reveal the shape of a univariate data distribution more sufficiently. The most frequently-used measures of variability are range, variance, and standard deviation.
- Range: The difference between the maximum and minimum.
- Variance: The amount of variation using the squared deviation of a variable from its mean.
- Standard deviation: The amount of variation using the square root of the variance.
- Interquartile Range (IQR): Statistical dispersion, which is the spread of the data. The IQR is also called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data.
Interquartile range = Upper Quartile – Lower Quartile = Q3 – Q1
- Uncorrected Sum of Squares (USS): The sum of the squared data points that includes error, unlike the corrected sum of squares. The formula for calculating the uncorrected sum of squares is:where:
- Xi represents the value of the i-th observation in the sample
- Corrected Sum of Squares (CSS): The sum of squared distance of data values from the mean. The corrected sum of squares is calculated as:where:
- y is the observed values of the dependent variable
- ŷ is the predicted values of the dependent variable from the regression model
- Σ is the summation operator, which means to sum up the squared differences across all observations.
- Standard Error (SE): The approximate standard deviation of a sample population. The standard error describes the variation between the calculated mean of the population and one which is considered known or accepted as accurate.
The standard error of an estimate can be calculated as the standard deviation divided by the square root of the sample size:
where:- σ is the population standard deviation
- √n is the square root of the sample size
- Co-efficient of Variation (CV): The relative dispersion of data points in a data series around the mean. It represents the ratio of the standard deviation to the mean. To calculate the CV for a sample, the formula is:where:
- σ = sample
- µ = mean for the population
- Unique Entity Count (UEC): The number of distinct entities or unique items within a given dataset. It is calculated by identifying the number of distinct entities within a dataset, regardless of how many times they appear. For example, if a dataset contains 10 transactions, but only 8 of them are unique, then the UEC for that dataset is 8.
Graphical Representations
Graphical representations. Used in univariate statistics to provide a visual representation of the data. There are several common graphical representations, including histograms, box plots, and frequency polygons.
- Histograms. Used to display the distribution of variables. They consist of a series of bars that represent the frequency of each value or range of values. Histograms can help identify the shape of the distribution, such as whether it is symmetrical or skewed.
- Box plots. Used to display the distribution of a variable, as well as the range, median, and quartiles. They consist of a box that represents the middle 50% of the data, with lines extending to the minimum and maximum values.
- Frequency polygons. Similar to histograms, but they display the data as a line graph rather than bars. They can help identify the shape of the distribution and provide a more continuous view of the data.
Univariate statistics provides a fundamental understanding of the characteristics of a single variable. Used to identify patterns, trends, and outliers in the data, and it provides a basis for further analysis.