Κανόνες συσχέτισης Association rules Αποθήκες και Εξόρυξη Δεδομένων Διδάσκουσα: Μαρία Χαλκίδη με βάση slides από J. Han and M. Kamber Data Mining: Concepts and Techniques, 2 nd edition
Τι είναι η εξόρυξη συσχετίσεων? Εξόρυξη κανόνων συσχέτισης: Εύρεση συχνών προτύπων, συσχετίσεων ή ανάμεσα σε σύνολα αντικειμένων σε ΒΔ δοσοληψιών, σχεσιακών ΒΔ και άλλων αποθηκών πληροφορίας. Εφαρμογές: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Παραδείγματα Μορφή κανόνα: Body Head [support, confidence]. buys(x, pizza ) buys(x, beers ) [0.5%, 60%] major(x, CS ) ^ takes(x, DB ) grade(x, A ) [1%, 75%] 2
Κανόνες συσχέτισης: Βασικές έννοιες Δεδομένου: (1) ΒΔ δοσοληψιών, (2) κάθε δοσοληψία είναι μία λίστα από αντικείμενα (αγορές ενός πελάτη σε μία επίσκεψη) Βρες: όλους τους κανόνες που συσχετίζουν την παρουσία ενός συνόλου αντικειμένων με εκείνο ενός άλλους συνόλου αντικειμένων Εφαρμογές Π.χ. 98% of people who purchase tires and auto accessories also get automotive services done * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics * (What other products should the store stocks up?) Attached mailing in direct marketing Detecting ping-pong ing of patients Transaction: patient Item: doctor/clinic visited by a patient support of a rule: number of common patients 3
Βασικές έννοιες: Συχνά πρότυπα και Κανόνες συσχέτισης Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Itemset X = {x 1,, x k } Βρες όλους τους κανόνες X Y με minimum support και confidence support, s, πιθανότητα ότι η δοσοληψία περιέχει X Y confidence, c, υποσυνθήκη πιθανότητα ότι μια δοσοληψία που περιέχει X περιέχει επίσης και το Y Customer buys beer Customer buys both Customer buys pizza Let sup min = 50%, conf min = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A D (60%, 100%) D A (60%, 75%) 4
Ορισμός Προβλήματος 5
Πρόβλημα εύρεσης κανόνων συσχέτισης Βρες όλα τα σύνολα αντικειμένων που έχουν ελάχιστο support συχνών αντικειμένων Χρήση συχνών itemsets για να παράγουν τους επιθυμητούς κανόνες 6
Κανόνες Συσχετίσεις Παράδειγμα Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F For rule A C: support = support({a U C}) = 50% confidence = support({a UC})/support({A}) = 66.6% The Apriori principle: Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% Any subset of a frequent itemset must be frequent 7
Παράδειγμα 8
Εξόρυξη frequent Itemsets: Βήμα κλειδί frequent itemsets: σύνολα των αντικειμένων που έχουν το ελάχιστο support Ένα υποσύνολο από ένα frequent itemset πρέπει επίσης να είναι frequent itemset if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Επαναληπτικά βρες τα frequent itemsets με cardinality απο 1 σε k (k-itemset) Χρήση frequent itemsets για να παράγουμε κανόνες συσχέτισης. 9
The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!=; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ; 10
Παράδειγμα παραγωγής υποψηφιων frequent itemsets L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={abcd} 11
Minimum Support: 2 trans Database D TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Αλγόριθμος Apriori Παράδειγμα Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C 1 L 1 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 C 2 C 2 L 2 itemset sup Scan D {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 C 3 itemset Scan D L 3 {2 3 5} itemset sup {2 3 5} 2 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} 13
Παραγωγή κανόνων συσχέτισης από Frequent Itemsets Για κάθε frequent itemset l, παρήγαγε όλα τα μη κενά υποσύνολα του l Για κάθε μη κενό υποσύνολο Ι Εξάγουμε τον κανόνα s (l-s) εάν support_count(i)/support_count(s) >= min_conf 14
Mining Multiple-Level Association Rules Items often form hierarchies It is difficult to find strong associations among data items at low level of abstraction Strong associations discovered at high concept levels may represent common sense knowledge Flexible support settings Items at the lower level are expected to have lower support Exploration of shared multi-level mining uniform support Level 1 min_sup = 5% computer [support = 10%] reduced support Level 1 min_sup = 5% Level 2 min_sup = 5% Laptop [support = 6%] desktop [support = 4%] Level 2 min_sup = 3% 15
Multiple-Level Association Rules: Search strategies (1) Level-by-level Full-breadth search No background knowledge of frequent itemsets is used for pruning Level-cross filtering by single item An item at the ith level is examined if and only if its parent node at the (i-1)th level is frequent Level 1 Min_sup=12% Computer Support =10% Level 2 Min_sup=3% laptop[support = 6%] (not examined) desktop [support=4%] (not examined) 16
Multiple-Level Association Rules: Search strategies (2) Level-cross filtering by k-itemset a k-itemset at the ith level is examined only if the corresponding parent k-itemset at the (i-1)th level is frequent Min_sup=5% Computer and printer Support =7% Min_sup=2% Laptop computer and b/w printer [Support =1%] Laptop computer and color printer [Support =2%] desktop computer and b/w printer [support =1%] desktop computer and color printer [support =3%] 17
Multi-level Association: Redundancy Filtering Some rules may be redundant due to ancestor relationships between items. Example computer printer [support = 8%, confidence = 70%] desktop printer [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the expected value, based on the rule s ancestor. 18
Mining Multi-Dimensional Association Single-dimensional rules: buys(x, milk ) buys(x, bread ) Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(x, 19-25 ) occupation(x, student ) buys(x, coke ) hybrid-dimension assoc. rules (repeated predicates) age(x, 19-25 ) buys(x, popcorn ) buys(x, coke ) Categorical Attributes: finite number of possible values, no ordering among values data cube approach Quantitative Attributes: numeric, implicit ordering among values discretization, clustering 19
Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated 1. Static discretization based on predefined concept hierarchies (data cube methods) 2. Dynamic discretization based on data distribution 3. Clustering: Distance-based association one dimensional clustering then association 4. Deviation: female => Wage: mean=$7/hr (overall mean = $9) 20
Quantitative Association Rules Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized 2-D quantitative association rules: A quan1 A quan2 A cat Age(X,34) income(x, 31K 40K ) buys(x, high resolution TV ) Cluster adjacent association rules to form general rules rules using a 2-D grid Example age(x, 34-35 ) income(x, 30-50K ) buys(x, high resolution TV ) 21
Mining Association and Correlations From association mining to correlation analysis 22
Criticism Confidence and Support (I) Let a rule R :A+B G, confidence = 85% and support =90%. Support( R) is high R is a significant rule. However, RHS (G) represents the 90% of the studied data a high proportion of the data contains G. there is a high probability RHS (G) to be satisfied by our data R is satisfied by a high percentage of the data under consideration + RHS is high supported. R may not make sense in making decisions or extracting general rule as regards the behaviour of the data. 23
Criticism to Support and Confidence Example 1: (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 24
Criticism to Support and Confidence (Cont.) Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates We need a measure of dependent or correlated events Itemset Support Interest X,Y 25% 2 X,Z 37.50% 0.9 Y,Z 12.50% 0.57 Interest (correlation, lift) taking both P(A) and P(B) in consideration P( A B) P( A) P( B) P(A U B)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated X 1 1 1 1 0 0 0 0 Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 25
Interestingness Measure: Correlations (Lift) Measure of dependent/correlated events: lift lift P( A B) P( A) P( B) Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 2000 / 5000 lift ( B, C) 0.89 3000 / 5000*3750 / 5000 1000 / 5000 lift ( B, C) 1.33 3000 / 5000*1250 / 5000 26
Lift (I) The lift of an association rule is the confidence divided by the proportion of all cases that are covered by the RHS. Lift = Confidence / P(RHS) It is a measure of the importance of the association. As for the values of lift there are some conditions to be considered: If lift 1 then RHS and LHS are independent, which indicates that the rule is not important. If lift + we have the following sub-cases: If RHS LHS or LHS RHS then the rule is not important. If P(RHS) 0 then the rule is not important. If P(RHS LHS) 1 then the rule is interesting. If lift = 0 means that P(RHSLHS) = 0 P(RHS LHS) = 0, which indicates that the rule is not important. 27
Lift (II) Lift gives an indication of rule significance, or how interesting is the rule. It represents the predictive advantage a rule offers over simply guessing based on the frequency of the rule consequence (RHS). It is an indication whether a rule could be considered as representative of the data so as to use it in the process of decision-making. 28
Leverage The leverage of an association rule is the proportion of additional cases covered by both the LHS and RHS above those expected if the LHS and RHS were independent Leverage = P(RHS and LHS) (P(LHS) P(RHS)) Leverage takes values in [-1,1]. if leverage <= 0, then there is a strong independence between LHS and RHS. else if leverage 1 indication of an important association rule 29
Example (III) Lift: p(rhs)=100/1000=0.1, Confidence =0.25 lift= 0.25/0.1 = 2.5. Leverage P( LHS and RHS)= 50/1000 = 0.05. The proportion of cases that would be expected to be covered by both LHS and RHS if LHS and RHS are independent is P( LHS)*p(RHS)=(200/1000) (100/1000) = 0.02. The leverage = (0.05-0.02) = 0.03. morning evening sum(row) milk 50 150 200 other 50 750 800 sum(col.) 100 900 1000 30