Distance Functions on Hierarchies Eftychia Baikousi
Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy
Definition of metric A distance function on a given set M is a function d:mxmr, that satisfies the following conditions: d(x,y) 0 and d(x,y)=0 iff x=y Distance is positive between two different points and is zero precisely from a point to itself It is symmetric: d(x,y)=d(y,x) The distance between x and y is the same in either direction It satisfies the triangle inequality: d(x,z) d(x,y)+ d(y,z) The distance between two points is the shortest distance along any path Is a metric
Definition of similarity metric et s(x,y) be the similarity between two points x and y, then the following properties hold: s(x,y) =1 only if x=y (0 s 1) s(x,y) =s(y,x) x and y (symmetry) The triangle inequality does not hold
Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy
Minkowski Family norm-1, City-Block, Manhattan 1 (x,y)= Σ i x i -y i norm-2, Euclidian 2 (x,y)=(σ i x i -y i 2 ) 1/2 norm-p, Minkowski p (x,y)=(σ i x i -y i p ) 1/p infinity norm =lim p (Σ i x i -y i p ) 1/p =max i ( x i -y i )
Set Based Simple matching coefficient Jaccard Coefficient Extended Jaccard, Tanimoto (Vector based) Cosine (Vector based) Dice s coefficient SMC= J ( A,B T( A,B ) cos( s # _ matching_ attribute_values # _attributes ) = = x,y ) = 1 A 2 X Y = X + Y 2 x x + A A y y A B B 2 B B A B
Edit Distance- evenshtein distance Edit distance between two strings x=x 1.x n, y=y 1 y m is defined as the minimum number of atomic edit operations needed Insert : ins(x,i,c)=x 1 x 2 x i cx i+1 x n Delete : del(x,i)=x 1 x 2 x i-1 x i+1 x n Replace : rep(x,i,c)=x 1 x 2 x i-1 cx i+1 x n Assign cost for every edit operation c(o)=1
Edit distances Needleman-Wunch distance or Sellers Algorithm Insert a character ins(x,i,c)=x 1 x 2 x i cx i+1 x n with cost(o)=1 a gap ins_g(x,i,g)=x 1 x 2 x i gx i+1 x n with cost(o)=g Delete a character del(x,i)=x 1 x 2 x i-1 x i+1 x n with cost(o)=1 a gap del_g(x,i)=x 1 x 2 x i-1 x i+1 x n with cost(o)=g Replace a character rep(x,i,c)=x 1 x 2 x i-1 cx i+1 x n with cost(o)=1
Edit distances Jaro distance et two strings s and t and s = characters in s that are common with t t = characters in t that are common with s T s,t =number of transportations of characters in s relative to t 1 s t' s Ts',t' Jaro( s,t ) = ( + + 3 s t 2 s' )
Edit distances Jaro distance Example et s =MARTHA and t =MARHTA s =6 t =6 T s,t = 2/2 since mismatched characters are T/H and H/T Jaro( = 1 3 ( 6 6 s,t + ) = 6 6 1 3 + s ( s 6 1 12 t' + t ) = s T + 2 s' 0.8055 s',t' )
Edit distances Jaro Winkler JWS(s,t)= Jaro(s,t) + ((prefixength * PREFIXSCAE * (1.0-Jaro(s,t))) Where: prefixength : the length of common prefix at the start of the string PREFIXSCAE: a constant scaling factor which gives more favourable ratings to strings that match from the beginning for a set prefix length
Edit distances Jaro Winkler Example et s =MARTHA and t =MARHTA and PREFIXSCAE = 0.1 Jaro(s,t)=0.8055 prefixength=3 JWS(s,t)= Jaro(s,t) + ((prefixength * PREFIXSCAE * (1.0-Jaro(s,t))) = 0.8055 + (3*0.1*(1-0.8055)) = 0.86385
Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy
Βασικές Έννοιες OAP Αφορά την ανάλυση κάποιων μετρήσιμων μεγεθών (μέτρων) πωλήσεις, απόθεμα, κέρδος,... Διαστάσεις: παράμετροι που καθορίζουν το περιβάλλον (context) των μέτρων ημερομηνία, προϊόν, τοποθεσία, πωλητής, Κύβοι: συνδυασμοί διαστάσεων που καθορίζουν κάποια μέτρα Ο κύβος καθορίζει ένα πολυδιάστατο χώρο διαστάσεων, με τα μέτραναείναισημείατουχώρουαυτού
Κύβοι για OAP REGION N S W PRODUCT Juice Cola Soap 10 13 Jan MONTH
Κύβοι για OAP
Βασικές Έννοιες OAP Τα δεδομένα θεωρούνται αποθηκευμένα σε ένα πολυδιάστατο πίνακα (multi-dimensional array), ο οποίος αποκαλείται και κύβος ή υπερκύβος (Cube και HyperCube αντίστοιχα). Οκύβοςείναιμιαομάδααπόκελιά δεδομένων (data cells). Κάθε κελί χαρακτηρίζεται μονοσήμαντα από τις αντίστοιχες τιμές των διαστάσεων (dimensions) του κύβου. Τα περιεχόμενα του κελιού ονομάζονται μέτρα (measures) και αναπαριστούν τις αποτιμώμενες αξίες του πραγματικού κόσμου.
Ιεραρχίες επιπέδων για OAP Μια διάσταση μοντελοποιεί όλους τους τρόπους με τους οποίους τα δεδομένα μπορούν να συναθροιστούν σε σχέση με μια συγκεκριμένη παράμετρο του περιεχομένου τους. Ημερομηνία, Προϊόν, Τοποθεσία, Πωλητής, Κάθε διάσταση έχει μια σχετική ιεραρχία επιπέδων συνάθροισης των δεδομένων (hierarchy of levels). Αυτό σημαίνει, ότι η διάσταση μπορεί να θεωρηθεί από πολλά επίπεδα αδρομέρειας. Ημερομηνία: μέρα, εβδομάδα, μήνας, χρόνος,
Ιεραρχίες Επιπέδων Ιεραρχίες Επιπέδων: κάθε διάσταση οργανώνεται σε διαφορετικά επίπεδα αδρομέρειας Year Ο χρήστης μπορεί να πλοηγηθεί από το ένα επίπεδο στο άλλο, δημιουργώντας νέους κύβους κάθε φορά Month Day Week Αδρομέρεια: το αντίθετο της λεπτομέρειας -- ο σωστός όρος είναι αδρομέρεια...
Κύβοι & ιεραρχίες διαστάσεων για OAP Region Sales volume Διαστάσεις: Product, Region, Date Ιεραρχίες διαστάσεων: Product Industry Category Country Region Year Quarter Month Product City Month Week Store Day
Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy
attice A lattice is a partially ordered set (poset) in which every pair of elements has a unique supremum and an inifimum The hierarchy of levels is formally defined as a lattice (,<) such that = ( 1,..., n, A) is a finite set of levels and < is a partial order defined among the levels of such that 1 < i <A 1 i n. the upper bound is always the level A, so that we can group all values into the single value all. The lower bound of the lattice is the most detailed level of the dimension.
Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy
Distances in the same level of Hierarchy et a dimension D, its levels of hierarchies 1 < i <A and two specific values x and y s.t. x, y i All 2 1
Distances in the same level of Hierarchy Explicit Based on instance Based on property Minkowski Path WRT to descendant WRT to ancestor Highway
Explicit assignment (Instance) All n 2 distances for the n values of the dom( i ) Identity dist( x,y ) = 0,ifx 1,ifx = y y i x 1 x 2 x n Example dist (x, y) = dist (milk, milk)= 0 dist (x, y) = dist (milk, yogurt)=1 All Products milk yogurt cola
Explicit assignment (Property) Attribute based level attributes: v [v 1 v n ] dom() Distance defined wrt attributes a 1,a2,..., an Example et the level Product have attributes: [color, weight] dist (x, y)= dist (milk, cola)= dist (color(milk), color(cola))= dist (white, black) All Products milk yogurt cola
Explicit assignment (Property) Applying ancestor function et x, y i then dist( x, y ) = Examples 0,anc 1,anc i ( x ) = ( x ) dist (milk, yogurt) =0 dist (milk, cola)=1 Applying descendant relation i j j anc anc i i j j ( ( Product Type y ) y ) dairy All Soft drink milk yogurt cola orange
Minkowski family reduce to Manhattan distance: x-y Example dist (x, y)=dist(1995-2005) = 1995-2005 =10 All Year 1990 1995 2000 2005
Path distances With respect to descendants ower level et x and y i and j a lower level, i.e., j < i dist( x,y ) = dist( f ( desc i j ( x )), f ( desc i j ( y ))) f : a function that picks one of the descendants Special case when j is the detailed level 1
Path distances WRT detailed level dist( x,y ) = dist( f ( desc i j ( x )), f ( desc i j ( y ))) Example dist (Greece, France)= dist (Athens, Paris) All f Where let ( desc Country City ( Greece )) = Athens Country Greece France f ( desc Country City ( France )) = Paris City Ioannina Athens Paris Nantes
Path distances With respect to the least common ancestor et x and y i and j the upper level, i.e., j < i where x and y have a common ancestor, dist( x, y ) = dist( i x,anc ( x )) + j dist( anc i j ( y ),y ) Example dist (x, y)= dist (Athens, Paris)= dist (Athens, Europe) + +dist (Europe, Paris) = 2 + 2 = 4 Continent Country Greece All Europe France City Ioannina Athens Paris Nantes
Highway distance et the values i form a set of k clusters, each cluster has a representative r k dist(x, y)= dist(x, r x )+ dist(r x, r y )+ dist(y, r y ) Specify: k 2 distances: dist (r x, r y ) and k distances: dist(x, r x ) Example: Country dist (Ioannina, Nantes)= dist (Ioannina, Athens) + Greece dist (Athens, Paris) + dist (Paris, Nantes) City All France Ioannina Athens Paris Nantes
Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy
Distances in different levels of Hierarchy Explicit dist 1 + dist 2 dist 3 +dist 4 WRT the detailed level WRT their least common ancestor Highway Attribute Based
Distances in different levels of Hierarchy et a dimension D, its levels of hierarchies 1 < i <A two specific values x and y s. t. x x y y y x y dist 2 y x < y dist 1 dist 3 ancestor of x in level y y x y = anc ( x ) x x x dist 4 y x a descendant of y in level x y x = desc y x ( y )
Distances in different levels of Hierarchy y x y dist 2 dist 1 dist 3 y Explicit assignment define dist x,y (x, y) x x, y y x x dist 4 y x dist 1 +dist 2 y y dist1 + dist2 = dist( x,anc ( x )) + dist( anc ( x ), x x y Where dist( anc ( x ), y ) is a distance of two values from the same level of hierarchy x special case: y is an ancestor of x then dist 2 =0 y )
Distances in different levels of Hierarchy y x y dist 2 dist 1 dist 3 y dist 3 +dist 4 y dist3 + dist4 = dist( y, f ( desc ( y ))) + f : function that picks a descendant x x dist( f ( desc x dist 4 y x y x ( y )),x ) Where y a distance of two values from the dist( f ( desc y )),x ) same level of hierarchy x ( special case: y is an ancestor of x then dist 4 =0
Distances in different levels of Hierarchy With respect to the detailed level x et and y x 1 = f ( desc 1 ( x )) 1 = f ( desc y 1 ( y )) dist( x, y ) = dist( x,x ) + dist( x1, y1 ) + dist( y, y1 1 ) Where dist(x 1, y 1 ) a distance of two values from the same level of hierarchy
Distances in different levels of Hierarchy With respect to their common ancestor et z the level of hierarchy where x and y have their first common ancestor dist( x, y ) = dist( x,anc ( x )) + dist( anc number of hops needed to reach the first common ancestor normalizing according to the height of the level z x z y ( y ), y )
Distances in different levels of Hierarchy Highway distance et every i is clustered into k i clusters and every cluster has its own representative r ki dist( x, y ) = dist( Attribute Based level attributes: v [v 1 v n ] dom() x,rx ) + dist( rx,ry ) + dist( Distance can be defined with respect to the attributes r y, y )
Types of evels Nominal = values hold the distinctness property values can be explicitly distinguished Ordinal < > values hold the distinctness property & the order property values abide by an order Interval + - values hold the distinctness, order & the addition property a unit of measurement exists there is meaning of the difference between two values