Support
Support is a measure of the generality of an entire association rule, its antecedent or consequent, or a single item that it references.
For an entire rule, antecedent, or consequent, Support is the percentage of groups that contain all items referenced by the rule, antecedent, or consequent.
For a single item, Support is the percentage of groups that contain it.
- N is the total number of customers.
- L is the number of customers who own the set of products in the antecedent.
- R is the number of customers who own the set of products in the consequent.
- LR is the number of customers who own all products in the association rule (this notation does not mean L*R).
- Support (L) = L/N
- Support (R) = R/N
- Support (L
R) = LR/N
For example, assume there are 10 customers (N=10). Six have a checking account (L=6), five have a savings account (R=5), and four have both (LR=4). Support (L) = 6/10 = 0.6, Support (R) = 5/10 = 0.5, and Support (L R) = 4/10 = 0.4.
Confidence
The Confidence of an association rule is the probability of R occurring in an item group given that L is in the item group:
Confidence (L R) = Support (L
R) = ) / Support (L)
Equivalently, Confidence is the percentage of groups containing L that also contain R:
Confidence (L R) = LR/L
For example, the Confidence that checking account ownership implies savings account ownership is 4/6.
Expected Value
The Expected Value of an association rule is the number of customers expected to have both L and R if there is no relationship between L and R. No relationship between L and R means customers who have L are neither more nor less likely to have R than are customers who do not have L.
- E_Value (L
R) = (L*R)/N
- E_Value (L
R) = (Support (L)) * (Support (R)) * N
The Expected Value of the number of customers with checking and savings is (6*5)/10 = 3.
Expected Confidence
The Expected Confidence of an association rule is the Confidence that results if there is no relationship between L and R:
E_Confidence (L R) = R/N
Because owning L has no effect on owning R, the Expected Confidence of the rule is also the percentage of customers who own R:
E_Confidence (L R) = Support (R)
The Expected Confidence of the rule that having a checking account implies having a savings account is 5/10.
Lift
- Lift (L
R) = 1 means there are exactly as many occurrences of R as expected. The presence of L neither increases nor decreases the probability of R.
- Lift (L
R) = 5 means there are 5 times as many occurrences of R than expected. The presence of L increases the probability of R by 5.
- Lift (L
R) = 0.5 means there are half as many occurrences of R as expected. The presence of L decreases the probability of R by half.
This is a formula for Lift:
Lift (L R) = LR / E_Confidence (L
R)
- Lift (L
R) = (Confidence (L
R))/E_Confidence (L
R)
- Lift (L
R) = (Confidence (L
R)) * (Support (R)) * N
The Lift of the rule that having a checking account implies having a savings account is 4/3.
Z-Score
- Z-score (L
R) = 0 means the actual and expected results are the same. The presence of L neither increases nor decreases the likelihood of owning R.
- Z-score (L
R) = 1 means the actual result is 1 standard deviation more than the expected result. The presence of L increases the likelihood of owning R.
- Z-score (L
R) = -3 means the actual result is 3 standard deviations less than the expected result. The presence of L decreases the likelihood of owning R.
A Z-score greater than 3 or less than -3 is statistically significant, which means the difference between the actual and expected result is very unlikely to be due to chance.
A Z-score helps answer the question of how confident you can be about the observed relationship between L and R, but does not directly indicate the magnitude of the relationship.
- Z-score (L
R) = (LR - E_Value (L
R)) / SQRT (E_Value (L
R)(1 - (E_Value (L
R) / N)))
- Z-score (L
R) = ((N * Support (LR) - N) * Support (L) * Support (R)) / SQRT (N * Support (L) * Support (R) * (1 - Support (L) * Support (R)))
The mean value is E_Value (L R). The expected value is 6*5/10, so the mean value is 3.
The actual value is LR, which is 4.
The standard deviation is SQRT (E_Value (L R) * (1 - (E_Value (L
R) / N)). The standard deviation is SQRT(3*(1-3/10)) = 1.449.
Therefore, the Z-score is (4 - 3) / 1.449 = .690.