**Data Mining**

Data mining is an analytical process developed to explore big data in order to detect consistent patterns or relationships between variables and to then substantiate the results applying the detected patterns to new subsets of data. The use of statistical formulas Lift and Chi squared can be used to detect levels of Interestingness in Big Data. This is one way to engage in data mining.

Lift measures the dependency/correlation between two sets of data. For example the Lift between A and B would be, Lift (A, B) =

Sup (A u B)/((Sup(A)*Sup(B)) where Sup is the support (likeliness) function, this is similar to the probability of something happening for a given data set)

If Lift(A, B) = 1 => A and B are independent

> 1: positively correlated

< 1: negatively correlated

An additional measure to test correlated events: X^2 or Chi Squared.

X^2 = Σ (Observed – Expected)^{2 }/ Expected

- General rules

X^2 = 0 => independent

X^2 > 0 => correlated, either positively or negatively, so it needs additional test such as Kulczynski.

Please see below an example of a Lift and Chi squared calculation.

**Lift Analysis**

Chips | ^Chips | Total Row | |

Burgers | 600 | 400 | 1000 |

^Burgers | 200 | 200 | 400 |

Total Column | 800 | 600 | 1400 |

Sup = Support.

Burger = B, Chips = C.

Lift(Burger, Chips) =

Sup(B u C)/((Sup(B)*Sup(C)) =

(600/1400)/((1000/1400)*(800/1400)) = 1.05 – This indicates a positive correlation between Burger and Chips.

Lift(B, ^C) =

Sup(B u ^C)/((Sup(B)*Sup(^C)) =

(400/1400)/((1000/1400)*(600/1400)) = 0.933333333…… – This indicates a negative correlation between Burger and ^Chips.

Lift(^B,C) =

Sup(^B u C)/((Sup(^B)*Sup(C)) =

(200/1400)/((400/1400)*(800/1400)) = 0.875 – This indicates a negative correlation between ^Burger and Chips.

Lift(^B,^C) =

Sup(^B u ^C)/((Sup(^B)*Sup(^C)) =

(200/1400)/((400/1400)*(600/1400)) = 1.166666666 …… – This indicates a positive correlation between ^Burger and ^Chips.

Shampoo | ^Shampoo | Total Row | |

Ketchup | 100 | 200 | 300 |

^Ketchup | 200 | 400 | 600 |

Total Column | 300 | 600 | 900 |

K = Kitchup, S = Shampoo.

Lift(K,S) =

Sup(K u S)/((Sup(K)*Sup(S)) =

(100/900)/((300/900)*(300/900)) = 1.0, No correlation between K and S.

Lift(K,^S) =

Sup(K u ^S)/((Sup(K)*Sup(^S)) =

(200/900)/((300/900)*(600/900)) = 1.0, No correlation between K and ^S.

Lift(^K,S) =

Sup(^K u S)/((Sup(^K)*Sup(S)) =

(200/900)/((600/900)*(300/900)) = 1.0, No correlation between ^K and S.

Lift(^K,^S) =

Sup(^K u ^S)/((Sup(^K)*Sup(^S)) =

(400/900)/((600/900)*(600/900)) = 1.0, No correlation between ^K and ^S.

Chips | ^Chips | Total Row | |

Burgers | 900 (800) | 100 (200) | 1000 |

^Burgers | 300 (400) | 200 (100) | 500 |

Total Column | 1200 | 300 | 1500 |

**Chi Squared Analysis.**

X^2 = Chi Squared.

X^2 = Σ (Observed – Expected)^2/Expected

^2 = Power of 2.

O = Observed; E = Expected.

B = Burger; C = Chips.

X^2(B,C) = (900 – 800)^2/800 = 12.5, As Observed > Expected, We have a positive correlation between B and C.

X^2(B,^C) = (100 – 200)^2/200 = 50.0, As Observed < Expected, We have a negative correlation between B and ^C.

X^2(^B,C) = (300 – 400)^2/400 = 2.5, As Observed < Expected, We have a negative correlation between ^B and C.

X^2(^B,^C) = (200 – 100)^2/100 = 100, As Observed > Expected, We have a positive correlation between ^B and ^C.

The Chi Squared result is the sum of the above 4 values; 12.5 + 50 + 2.5 + 100 = 165. As 165 is positive and as Observed > Expected (for B union C), we have a positive correlation between B and C.

Sausages | ^Sausages | Total Row | |

Burgers | 800 (800) | 200 (200) | 1000 |

^Burgers | 400 (400) | 100 (100) | 500 |

Total Column | 1200 | 300 | 1500 |

B = Burger; S = Sausages.

X^2(B,S) = (800 – 800)^2/800 = 0, No correlation between B and S, they are independent of each other.

X^2(B,^S) = (200 – 200)^2/200 = 0, No correlation between B and ^S, they are independent of each other.

X^2(^B,S) = (400 – 400)^2/400 = 0, No correlation between ^B and S, they are independent of each other.

X^2(^B,^S) = (100 – 100)^2/100 = 0, No correlation between ^B and ^S, they are independent of each other.

The Chi Squared result is the sum of the above 4 values; 0+0+0+0 = 0. As the result is 0, we have independence between B and S.

Lift and X^2 would prove to be inadequate algorithms if there was a sizeable amount of null events/transactions in the data set.

Kulczynski’s algorithm would rectify this.