Κανόνες συσχέτισης Association rules

Σχετικά έγγραφα
Αποθήκες Δεδομένων και Εξόρυξη Δεδομένων

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Πανεπιστήµιο Πειραιώς Τµήµα Πληροφορικής. Εξόρυξη Γνώσης από εδοµένα (Data Mining) Εξόρυξη Κανόνων Συσχετίσεων. Γιάννης Θεοδωρίδης

Math 6 SL Probability Distributions Practice Test Mark Scheme

Other Test Constructions: Likelihood Ratio & Bayes Tests

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

Αποθήκες και Εξόρυξη Δεδομένων

Μηχανική Μάθηση Hypothesis Testing

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

The Simply Typed Lambda Calculus

EE512: Error Control Coding

C.S. 430 Assignment 6, Sample Solutions

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Fractional Colorings and Zykov Products of graphs

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

TID Items. Τ = {t 1, t 2,.., t N } ένα σύνολο από δοσοληψίες, όπου κάθε t i είναι ένα στοιχειοσύνολο

Section 8.3 Trigonometric Equations

derivation of the Laplacian from rectangular to spherical coordinates

Inverse trigonometric functions & General Solution of Trigonometric Equations

Οι διαφάνειες στηρίζονται στο P.-N. Tan, M.Steinbach, V. Kumar, «Introduction to Data Mining», Addison Wesley, 2006

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Matrices and Determinants

Homework 3 Solutions

2 Composition. Invertible Mappings

CRASH COURSE IN PRECALCULUS

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Συστήματα Διαχείρισης Βάσεων Δεδομένων

Finite Field Problems: Solutions

Approximation of distance between locations on earth given by latitude and longitude

Data mining Εξόρυξη εδοµένων. o Association rules mining o Classification o Clustering o Text Mining o Web Mining

The challenges of non-stable predicates

Assalamu `alaikum wr. wb.

Areas and Lengths in Polar Coordinates

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΝΟΣΗΛΕΥΤΙΚΗΣ

Models for Probabilistic Programs with an Adversary

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

ST5224: Advanced Statistical Theory II

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Προεπεξεργασία Δεδομένων. Αποθήκες και Εξόρυξη Δεδομένων Διδάσκουσα: Μαρία Χαλκίδη

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

Lecture Notes for Chapter 6. Introduction to Data Mining

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

TMA4115 Matematikk 3

Homework 8 Model Solution Section

Οι διαφάνειες στηρίζονται στο P.-N. Tan, M.Steinbach, V. Kumar, «Introduction to Data Mining», Addison Wesley, 2006

Solutions to Exercise Sheet 5

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Lecture 2. Soundness and completeness of propositional logic

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Example Sheet 3 Solutions

Every set of first-order formulas is equivalent to an independent set

Areas and Lengths in Polar Coordinates

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

Lecture 34 Bootstrap confidence intervals

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 10η: Basics of Game Theory part 2 Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

department listing department name αχχουντσ ϕανε βαλικτ δδσϕηασδδη σδηφγ ασκϕηλκ τεχηνιχαλ αλαν ϕουν διξ τεχηνιχαλ ϕοην µαριανι

Second Order RLC Filters

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

Αποθήκες Δεδομένων και Εξόρυξη Δεδομένων:

Repeated measures Επαναληπτικές μετρήσεις

Στο εστιατόριο «ToDokimasesPrinToBgaleisStonKosmo?» έξω από τους δακτυλίους του Κρόνου, οι παραγγελίες γίνονται ηλεκτρονικά.

Προσομοίωση BP με το Bizagi Modeler

Instruction Execution Times

Statistical Inference I Locally most powerful tests

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

IIT JEE (2013) (Trigonomtery 1) Solutions

Uniform Convergence of Fourier Series Michael Taylor

Αποθήκες Δεδομένων και Εξόρυξη Δεδομένων:

Block Ciphers Modes. Ramki Thurimella

5.4 The Poisson Distribution.

Second Order Partial Differential Equations

Notes on the Open Economy

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΕΙΡΑΙΑ ΤΜΗΜΑ ΝΑΥΤΙΛΙΑΚΩΝ ΣΠΟΥΔΩΝ ΠΡΟΓΡΑΜΜΑ ΜΕΤΑΠΤΥΧΙΑΚΩΝ ΣΠΟΥΔΩΝ ΣΤΗΝ ΝΑΥΤΙΛΙΑ

[1] P Q. Fig. 3.1

(1) Describe the process by which mercury atoms become excited in a fluorescent tube (3)

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Εγκατάσταση λογισμικού και αναβάθμιση συσκευής Device software installation and software upgrade

Μεταπτυχιακή διατριβή. Ανδρέας Παπαευσταθίου

5. Choice under Uncertainty

CE 530 Molecular Simulation

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

PARTIAL NOTES for 6.1 Trigonometric Identities

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

Κανόνες Συσχέτισης Ι. Εισαγωγή. Εισαγωγή. Ορισμοί. Ορισμοί. Ορισμοί. Market-Basket transactions (Το καλάθι της νοικοκυράς!)

w o = R 1 p. (1) R = p =. = 1

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΕΙΡΑΙΩΣ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΠΜΣ «ΠΡΟΗΓΜΕΝΑ ΣΥΣΤΗΜΑΤΑ ΠΛΗΡΟΦΟΡΙΚΗΣ» ΚΑΤΕΥΘΥΝΣΗ «ΕΥΦΥΕΙΣ ΤΕΧΝΟΛΟΓΙΕΣ ΕΠΙΚΟΙΝΩΝΙΑΣ ΑΝΘΡΩΠΟΥ - ΥΠΟΛΟΓΙΣΤΗ»

Advanced Subsidiary Unit 1: Understanding and Written Response

Concrete Mathematics Exercises from 30 September 2016

MATHEMATICS. 1. If A and B are square matrices of order 3 such that A = -1, B =3, then 3AB = 1) -9 2) -27 3) -81 4) 81

Αναερόβια Φυσική Κατάσταση

1) Formulation of the Problem as a Linear Programming Model

Transcript:

Κανόνες συσχέτισης Association rules Αποθήκες και Εξόρυξη Δεδομένων Διδάσκουσα: Μαρία Χαλκίδη με βάση slides από J. Han and M. Kamber Data Mining: Concepts and Techniques, 2 nd edition

Τι είναι η εξόρυξη συσχετίσεων? Εξόρυξη κανόνων συσχέτισης: Εύρεση συχνών προτύπων, συσχετίσεων ή ανάμεσα σε σύνολα αντικειμένων σε ΒΔ δοσοληψιών, σχεσιακών ΒΔ και άλλων αποθηκών πληροφορίας. Εφαρμογές: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Παραδείγματα Μορφή κανόνα: Body Head [support, confidence]. buys(x, pizza ) buys(x, beers ) [0.5%, 60%] major(x, CS ) ^ takes(x, DB ) grade(x, A ) [1%, 75%] 2

Κανόνες συσχέτισης: Βασικές έννοιες Δεδομένου: (1) ΒΔ δοσοληψιών, (2) κάθε δοσοληψία είναι μία λίστα από αντικείμενα (αγορές ενός πελάτη σε μία επίσκεψη) Βρες: όλους τους κανόνες που συσχετίζουν την παρουσία ενός συνόλου αντικειμένων με εκείνο ενός άλλους συνόλου αντικειμένων Εφαρμογές Π.χ. 98% of people who purchase tires and auto accessories also get automotive services done * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics * (What other products should the store stocks up?) Attached mailing in direct marketing Detecting ping-pong ing of patients Transaction: patient Item: doctor/clinic visited by a patient support of a rule: number of common patients 3

Βασικές έννοιες: Συχνά πρότυπα και Κανόνες συσχέτισης Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Itemset X = {x 1,, x k } Βρες όλους τους κανόνες X Y με minimum support και confidence support, s, πιθανότητα ότι η δοσοληψία περιέχει X Y confidence, c, υποσυνθήκη πιθανότητα ότι μια δοσοληψία που περιέχει X περιέχει επίσης και το Y Customer buys beer Customer buys both Customer buys pizza Let sup min = 50%, conf min = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A D (60%, 100%) D A (60%, 75%) 4

Ορισμός Προβλήματος 5

Πρόβλημα εύρεσης κανόνων συσχέτισης Βρες όλα τα σύνολα αντικειμένων που έχουν ελάχιστο support συχνών αντικειμένων Χρήση συχνών itemsets για να παράγουν τους επιθυμητούς κανόνες 6

Κανόνες Συσχετίσεις Παράδειγμα Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F For rule A C: support = support({a U C}) = 50% confidence = support({a UC})/support({A}) = 66.6% The Apriori principle: Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% Any subset of a frequent itemset must be frequent 7

Παράδειγμα 8

Εξόρυξη frequent Itemsets: Βήμα κλειδί frequent itemsets: σύνολα των αντικειμένων που έχουν το ελάχιστο support Ένα υποσύνολο από ένα frequent itemset πρέπει επίσης να είναι frequent itemset if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Επαναληπτικά βρες τα frequent itemsets με cardinality απο 1 σε k (k-itemset) Χρήση frequent itemsets για να παράγουμε κανόνες συσχέτισης. 9

The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!=; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ; 10

Παράδειγμα παραγωγής υποψηφιων frequent itemsets L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={abcd} 11

Minimum Support: 2 trans Database D TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Αλγόριθμος Apriori Παράδειγμα Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C 1 L 1 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 C 2 C 2 L 2 itemset sup Scan D {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 C 3 itemset Scan D L 3 {2 3 5} itemset sup {2 3 5} 2 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} 13

Παραγωγή κανόνων συσχέτισης από Frequent Itemsets Για κάθε frequent itemset l, παρήγαγε όλα τα μη κενά υποσύνολα του l Για κάθε μη κενό υποσύνολο Ι Εξάγουμε τον κανόνα s (l-s) εάν support_count(i)/support_count(s) >= min_conf 14

Mining Multiple-Level Association Rules Items often form hierarchies It is difficult to find strong associations among data items at low level of abstraction Strong associations discovered at high concept levels may represent common sense knowledge Flexible support settings Items at the lower level are expected to have lower support Exploration of shared multi-level mining uniform support Level 1 min_sup = 5% computer [support = 10%] reduced support Level 1 min_sup = 5% Level 2 min_sup = 5% Laptop [support = 6%] desktop [support = 4%] Level 2 min_sup = 3% 15

Multiple-Level Association Rules: Search strategies (1) Level-by-level Full-breadth search No background knowledge of frequent itemsets is used for pruning Level-cross filtering by single item An item at the ith level is examined if and only if its parent node at the (i-1)th level is frequent Level 1 Min_sup=12% Computer Support =10% Level 2 Min_sup=3% laptop[support = 6%] (not examined) desktop [support=4%] (not examined) 16

Multiple-Level Association Rules: Search strategies (2) Level-cross filtering by k-itemset a k-itemset at the ith level is examined only if the corresponding parent k-itemset at the (i-1)th level is frequent Min_sup=5% Computer and printer Support =7% Min_sup=2% Laptop computer and b/w printer [Support =1%] Laptop computer and color printer [Support =2%] desktop computer and b/w printer [support =1%] desktop computer and color printer [support =3%] 17

Multi-level Association: Redundancy Filtering Some rules may be redundant due to ancestor relationships between items. Example computer printer [support = 8%, confidence = 70%] desktop printer [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the expected value, based on the rule s ancestor. 18

Mining Multi-Dimensional Association Single-dimensional rules: buys(x, milk ) buys(x, bread ) Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(x, 19-25 ) occupation(x, student ) buys(x, coke ) hybrid-dimension assoc. rules (repeated predicates) age(x, 19-25 ) buys(x, popcorn ) buys(x, coke ) Categorical Attributes: finite number of possible values, no ordering among values data cube approach Quantitative Attributes: numeric, implicit ordering among values discretization, clustering 19

Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated 1. Static discretization based on predefined concept hierarchies (data cube methods) 2. Dynamic discretization based on data distribution 3. Clustering: Distance-based association one dimensional clustering then association 4. Deviation: female => Wage: mean=$7/hr (overall mean = $9) 20

Quantitative Association Rules Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized 2-D quantitative association rules: A quan1 A quan2 A cat Age(X,34) income(x, 31K 40K ) buys(x, high resolution TV ) Cluster adjacent association rules to form general rules rules using a 2-D grid Example age(x, 34-35 ) income(x, 30-50K ) buys(x, high resolution TV ) 21

Mining Association and Correlations From association mining to correlation analysis 22

Criticism Confidence and Support (I) Let a rule R :A+B G, confidence = 85% and support =90%. Support( R) is high R is a significant rule. However, RHS (G) represents the 90% of the studied data a high proportion of the data contains G. there is a high probability RHS (G) to be satisfied by our data R is satisfied by a high percentage of the data under consideration + RHS is high supported. R may not make sense in making decisions or extracting general rule as regards the behaviour of the data. 23

Criticism to Support and Confidence Example 1: (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 24

Criticism to Support and Confidence (Cont.) Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates We need a measure of dependent or correlated events Itemset Support Interest X,Y 25% 2 X,Z 37.50% 0.9 Y,Z 12.50% 0.57 Interest (correlation, lift) taking both P(A) and P(B) in consideration P( A B) P( A) P( B) P(A U B)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated X 1 1 1 1 0 0 0 0 Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 25

Interestingness Measure: Correlations (Lift) Measure of dependent/correlated events: lift lift P( A B) P( A) P( B) Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 2000 / 5000 lift ( B, C) 0.89 3000 / 5000*3750 / 5000 1000 / 5000 lift ( B, C) 1.33 3000 / 5000*1250 / 5000 26

Lift (I) The lift of an association rule is the confidence divided by the proportion of all cases that are covered by the RHS. Lift = Confidence / P(RHS) It is a measure of the importance of the association. As for the values of lift there are some conditions to be considered: If lift 1 then RHS and LHS are independent, which indicates that the rule is not important. If lift + we have the following sub-cases: If RHS LHS or LHS RHS then the rule is not important. If P(RHS) 0 then the rule is not important. If P(RHS LHS) 1 then the rule is interesting. If lift = 0 means that P(RHSLHS) = 0 P(RHS LHS) = 0, which indicates that the rule is not important. 27

Lift (II) Lift gives an indication of rule significance, or how interesting is the rule. It represents the predictive advantage a rule offers over simply guessing based on the frequency of the rule consequence (RHS). It is an indication whether a rule could be considered as representative of the data so as to use it in the process of decision-making. 28

Leverage The leverage of an association rule is the proportion of additional cases covered by both the LHS and RHS above those expected if the LHS and RHS were independent Leverage = P(RHS and LHS) (P(LHS) P(RHS)) Leverage takes values in [-1,1]. if leverage <= 0, then there is a strong independence between LHS and RHS. else if leverage 1 indication of an important association rule 29

Example (III) Lift: p(rhs)=100/1000=0.1, Confidence =0.25 lift= 0.25/0.1 = 2.5. Leverage P( LHS and RHS)= 50/1000 = 0.05. The proportion of cases that would be expected to be covered by both LHS and RHS if LHS and RHS are independent is P( LHS)*p(RHS)=(200/1000) (100/1000) = 0.02. The leverage = (0.05-0.02) = 0.03. morning evening sum(row) milk 50 150 200 other 50 750 800 sum(col.) 100 900 1000 30