Data Mining and Lift and Chi Squared Analysis

Data Mining

Data mining is an analytical process developed to explore big data in order to detect consistent patterns or relationships between variables and to then substantiate the results applying the detected patterns to new subsets of data. The use of statistical formulas Lift and Chi squared can be used to detect levels of Interestingness in Big Data. This is one way to engage in data mining.

Lift measures the dependency/correlation between two sets of data. For example the Lift between A and B would be, Lift (A, B) =

Sup (A u B)/((Sup(A)*Sup(B)) where Sup is the support (likeliness) function, this is similar to the probability of something happening for a given data set)

If Lift(A, B) = 1 => A and B are independent

> 1: positively correlated

< 1: negatively correlated

An additional measure to test correlated events: X^2 or Chi Squared.

X^2 = Σ (Observed – Expected)2 / Expected

  • General rules

X^2 = 0 => independent

X^2 > 0 => correlated, either positively or negatively, so it needs additional test such as Kulczynski.

Please see below an example of a Lift and Chi squared calculation.

Lift Analysis

Chips ^Chips Total Row
Burgers 600 400 1000
^Burgers 200 200  400
Total Column 800 600 1400

Sup = Support.

Burger = B, Chips = C.

Lift(Burger, Chips) =

Sup(B u C)/((Sup(B)*Sup(C)) =

(600/1400)/((1000/1400)*(800/1400))  =  1.05 – This indicates a positive correlation between Burger and Chips.

Lift(B, ^C) =

Sup(B u ^C)/((Sup(B)*Sup(^C)) =

(400/1400)/((1000/1400)*(600/1400))  = 0.933333333…… – This indicates a negative correlation between Burger and ^Chips.

Lift(^B,C) =

Sup(^B u C)/((Sup(^B)*Sup(C)) =

(200/1400)/((400/1400)*(800/1400))  = 0.875 – This indicates a negative correlation between ^Burger and Chips.

Lift(^B,^C) =

Sup(^B u ^C)/((Sup(^B)*Sup(^C)) =

(200/1400)/((400/1400)*(600/1400))  = 1.166666666 …… – This indicates a positive correlation between ^Burger and ^Chips.

 

Shampoo ^Shampoo Total Row
Ketchup 100 200 300
^Ketchup 200 400 600
Total Column 300 600 900

K = Kitchup, S = Shampoo.

Lift(K,S) =

Sup(K u S)/((Sup(K)*Sup(S)) =

(100/900)/((300/900)*(300/900)) = 1.0, No correlation between K and S.

Lift(K,^S) =

Sup(K u ^S)/((Sup(K)*Sup(^S)) =

(200/900)/((300/900)*(600/900)) = 1.0, No correlation between K and ^S.

Lift(^K,S) =

Sup(^K u S)/((Sup(^K)*Sup(S)) =

(200/900)/((600/900)*(300/900)) = 1.0, No correlation between ^K and S.

Lift(^K,^S) =

Sup(^K u ^S)/((Sup(^K)*Sup(^S)) =

(400/900)/((600/900)*(600/900)) = 1.0, No correlation between ^K and ^S.

 

Chips ^Chips Total Row
Burgers 900 (800) 100 (200) 1000
^Burgers 300 (400) 200 (100)  500
Total Column 1200 300 1500

Chi Squared Analysis.

X^2 = Chi Squared.

X^2 = Σ (Observed – Expected)^2/Expected

^2 = Power of 2.

O = Observed; E = Expected.

B = Burger; C = Chips.

X^2(B,C) = (900 – 800)^2/800  = 12.5, As Observed > Expected, We have a positive correlation between B and C.

X^2(B,^C) = (100 – 200)^2/200  = 50.0, As Observed < Expected, We have a negative correlation between B and ^C.

X^2(^B,C) = (300 – 400)^2/400  = 2.5, As Observed < Expected, We have a negative correlation between ^B and C.

X^2(^B,^C) = (200 – 100)^2/100  = 100, As Observed > Expected, We have a positive correlation between ^B and ^C.

The Chi Squared result is the sum of the above 4 values; 12.5 + 50 + 2.5 + 100 = 165. As 165 is positive and as Observed > Expected (for B union C), we have a positive correlation between B and C.

 

Sausages ^Sausages Total Row
Burgers 800 (800) 200 (200) 1000
^Burgers 400 (400) 100 (100)  500
Total Column 1200 300 1500

B = Burger; S = Sausages.

X^2(B,S) = (800 – 800)^2/800  = 0, No correlation between B and S, they are independent of each other.

X^2(B,^S) = (200 – 200)^2/200  = 0, No correlation between B and ^S, they are independent of each other.

X^2(^B,S) = (400 – 400)^2/400  = 0, No correlation between ^B and S, they are independent of each other.

X^2(^B,^S) = (100 – 100)^2/100  = 0, No correlation between ^B and ^S, they are independent of each other.

The Chi Squared result is the sum of the above 4 values; 0+0+0+0 = 0. As the result is 0, we have independence between B and S.

Lift and X^2 would prove to be inadequate algorithms if there was a sizeable amount of null events/transactions in the data set.

Kulczynski’s algorithm would rectify this.

 

Leave a Reply

Your email address will not be published. Required fields are marked *