Distance Functions on Hierarchies. Eftychia Baikousi



Σχετικά έγγραφα
Data Warehouse Refreshment via ETL tools. Panos Vassiliadis

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

2 Composition. Invertible Mappings

Fractional Colorings and Zykov Products of graphs

Example Sheet 3 Solutions

Homework 8 Model Solution Section

Reminders: linear functions

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Inverse trigonometric functions & General Solution of Trigonometric Equations

EE512: Error Control Coding

Second Order Partial Differential Equations

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Solutions to Exercise Sheet 5

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

ST5224: Advanced Statistical Theory II

The Simply Typed Lambda Calculus

Lecture 15 - Root System Axiomatics

Trigonometric Formula Sheet

Other Test Constructions: Likelihood Ratio & Bayes Tests

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Chapter 3: Ordinal Numbers

The challenges of non-stable predicates

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Quadratic Expressions

CRASH COURSE IN PRECALCULUS

Matrices and Determinants

PARTIAL NOTES for 6.1 Trigonometric Identities

Pg The perimeter is P = 3x The area of a triangle is. where b is the base, h is the height. In our case b = x, then the area is

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

Every set of first-order formulas is equivalent to an independent set

Mock Exam 7. 1 Hong Kong Educational Publishing Company. Section A 1. Reference: HKDSE Math M Q2 (a) (1 + kx) n 1M + 1A = (1) =

Section 8.3 Trigonometric Equations

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Άμεση Αναλυτική Επεξεργασία (OLAP)

Statistical Inference I Locally most powerful tests

Homework 3 Solutions

4.6 Autoregressive Moving Average Model ARMA(1,1)

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Areas and Lengths in Polar Coordinates

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Numerical Analysis FMN011

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Strukturalna poprawność argumentu.

New bounds for spherical two-distance sets and equiangular lines

TMA4115 Matematikk 3

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Αλγόριθμοι και πολυπλοκότητα NP-Completeness (2)

Distances in Sierpiński Triangle Graphs

A Note on Intuitionistic Fuzzy. Equivalence Relation

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

D Alembert s Solution to the Wave Equation

C.S. 430 Assignment 6, Sample Solutions

Section 9.2 Polar Equations and Graphs

w o = R 1 p. (1) R = p =. = 1

Approximation of distance between locations on earth given by latitude and longitude

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Areas and Lengths in Polar Coordinates

5. Choice under Uncertainty

derivation of the Laplacian from rectangular to spherical coordinates

Tridiagonal matrices. Gérard MEURANT. October, 2008

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

Congruence Classes of Invertible Matrices of Order 3 over F 2

Uniform Convergence of Fourier Series Michael Taylor

Finite Field Problems: Solutions

Commutative Monoids in Intuitionistic Fuzzy Sets

ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ. ΕΠΛ342: Βάσεις Δεδομένων. Χειμερινό Εξάμηνο Φροντιστήριο 10 ΛΥΣΕΙΣ. Επερωτήσεις SQL

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Second Order RLC Filters

Depth versus Rigidity in the Design of International Trade Agreements. Leslie Johns

CHAPTER 101 FOURIER SERIES FOR PERIODIC FUNCTIONS OF PERIOD

About these lecture notes. Simply Typed λ-calculus. Types

6.3 Forecasting ARMA processes

If we restrict the domain of y = sin x to [ π, π ], the restrict function. y = sin x, π 2 x π 2

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

General 2 2 PT -Symmetric Matrices and Jordan Blocks 1

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

The ε-pseudospectrum of a Matrix

Instruction Execution Times

ΑΛΓΟΡΙΘΜΟΙ Άνοιξη I. ΜΗΛΗΣ

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

If we restrict the domain of y = sin x to [ π 2, π 2

Notes on the Open Economy

Lecture 2. Soundness and completeness of propositional logic

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

From the finite to the transfinite: Λµ-terms and streams

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 11/3/2006

Parametrized Surfaces

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

Strain gauge and rosettes

Homomorphism in Intuitionistic Fuzzy Automata

Συσταδοποίηση Ι. Τι είναι συσταδοποίηση. Εφαρμογές. Εφαρμογές. Εισαγωγή Θέματα που θα μας απασχολήσουν σήμερα. Πότε μια συσταδοποίηση είναι καλή;

Transcript:

Distance Functions on Hierarchies Eftychia Baikousi

Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy

Definition of metric A distance function on a given set M is a function d:mxmr, that satisfies the following conditions: d(x,y) 0 and d(x,y)=0 iff x=y Distance is positive between two different points and is zero precisely from a point to itself It is symmetric: d(x,y)=d(y,x) The distance between x and y is the same in either direction It satisfies the triangle inequality: d(x,z) d(x,y)+ d(y,z) The distance between two points is the shortest distance along any path Is a metric

Definition of similarity metric et s(x,y) be the similarity between two points x and y, then the following properties hold: s(x,y) =1 only if x=y (0 s 1) s(x,y) =s(y,x) x and y (symmetry) The triangle inequality does not hold

Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy

Minkowski Family norm-1, City-Block, Manhattan 1 (x,y)= Σ i x i -y i norm-2, Euclidian 2 (x,y)=(σ i x i -y i 2 ) 1/2 norm-p, Minkowski p (x,y)=(σ i x i -y i p ) 1/p infinity norm =lim p (Σ i x i -y i p ) 1/p =max i ( x i -y i )

Set Based Simple matching coefficient Jaccard Coefficient Extended Jaccard, Tanimoto (Vector based) Cosine (Vector based) Dice s coefficient SMC= J ( A,B T( A,B ) cos( s # _ matching_ attribute_values # _attributes ) = = x,y ) = 1 A 2 X Y = X + Y 2 x x + A A y y A B B 2 B B A B

Edit Distance- evenshtein distance Edit distance between two strings x=x 1.x n, y=y 1 y m is defined as the minimum number of atomic edit operations needed Insert : ins(x,i,c)=x 1 x 2 x i cx i+1 x n Delete : del(x,i)=x 1 x 2 x i-1 x i+1 x n Replace : rep(x,i,c)=x 1 x 2 x i-1 cx i+1 x n Assign cost for every edit operation c(o)=1

Edit distances Needleman-Wunch distance or Sellers Algorithm Insert a character ins(x,i,c)=x 1 x 2 x i cx i+1 x n with cost(o)=1 a gap ins_g(x,i,g)=x 1 x 2 x i gx i+1 x n with cost(o)=g Delete a character del(x,i)=x 1 x 2 x i-1 x i+1 x n with cost(o)=1 a gap del_g(x,i)=x 1 x 2 x i-1 x i+1 x n with cost(o)=g Replace a character rep(x,i,c)=x 1 x 2 x i-1 cx i+1 x n with cost(o)=1

Edit distances Jaro distance et two strings s and t and s = characters in s that are common with t t = characters in t that are common with s T s,t =number of transportations of characters in s relative to t 1 s t' s Ts',t' Jaro( s,t ) = ( + + 3 s t 2 s' )

Edit distances Jaro distance Example et s =MARTHA and t =MARHTA s =6 t =6 T s,t = 2/2 since mismatched characters are T/H and H/T Jaro( = 1 3 ( 6 6 s,t + ) = 6 6 1 3 + s ( s 6 1 12 t' + t ) = s T + 2 s' 0.8055 s',t' )

Edit distances Jaro Winkler JWS(s,t)= Jaro(s,t) + ((prefixength * PREFIXSCAE * (1.0-Jaro(s,t))) Where: prefixength : the length of common prefix at the start of the string PREFIXSCAE: a constant scaling factor which gives more favourable ratings to strings that match from the beginning for a set prefix length

Edit distances Jaro Winkler Example et s =MARTHA and t =MARHTA and PREFIXSCAE = 0.1 Jaro(s,t)=0.8055 prefixength=3 JWS(s,t)= Jaro(s,t) + ((prefixength * PREFIXSCAE * (1.0-Jaro(s,t))) = 0.8055 + (3*0.1*(1-0.8055)) = 0.86385

Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy

Βασικές Έννοιες OAP Αφορά την ανάλυση κάποιων μετρήσιμων μεγεθών (μέτρων) πωλήσεις, απόθεμα, κέρδος,... Διαστάσεις: παράμετροι που καθορίζουν το περιβάλλον (context) των μέτρων ημερομηνία, προϊόν, τοποθεσία, πωλητής, Κύβοι: συνδυασμοί διαστάσεων που καθορίζουν κάποια μέτρα Ο κύβος καθορίζει ένα πολυδιάστατο χώρο διαστάσεων, με τα μέτραναείναισημείατουχώρουαυτού

Κύβοι για OAP REGION N S W PRODUCT Juice Cola Soap 10 13 Jan MONTH

Κύβοι για OAP

Βασικές Έννοιες OAP Τα δεδομένα θεωρούνται αποθηκευμένα σε ένα πολυδιάστατο πίνακα (multi-dimensional array), ο οποίος αποκαλείται και κύβος ή υπερκύβος (Cube και HyperCube αντίστοιχα). Οκύβοςείναιμιαομάδααπόκελιά δεδομένων (data cells). Κάθε κελί χαρακτηρίζεται μονοσήμαντα από τις αντίστοιχες τιμές των διαστάσεων (dimensions) του κύβου. Τα περιεχόμενα του κελιού ονομάζονται μέτρα (measures) και αναπαριστούν τις αποτιμώμενες αξίες του πραγματικού κόσμου.

Ιεραρχίες επιπέδων για OAP Μια διάσταση μοντελοποιεί όλους τους τρόπους με τους οποίους τα δεδομένα μπορούν να συναθροιστούν σε σχέση με μια συγκεκριμένη παράμετρο του περιεχομένου τους. Ημερομηνία, Προϊόν, Τοποθεσία, Πωλητής, Κάθε διάσταση έχει μια σχετική ιεραρχία επιπέδων συνάθροισης των δεδομένων (hierarchy of levels). Αυτό σημαίνει, ότι η διάσταση μπορεί να θεωρηθεί από πολλά επίπεδα αδρομέρειας. Ημερομηνία: μέρα, εβδομάδα, μήνας, χρόνος,

Ιεραρχίες Επιπέδων Ιεραρχίες Επιπέδων: κάθε διάσταση οργανώνεται σε διαφορετικά επίπεδα αδρομέρειας Year Ο χρήστης μπορεί να πλοηγηθεί από το ένα επίπεδο στο άλλο, δημιουργώντας νέους κύβους κάθε φορά Month Day Week Αδρομέρεια: το αντίθετο της λεπτομέρειας -- ο σωστός όρος είναι αδρομέρεια...

Κύβοι & ιεραρχίες διαστάσεων για OAP Region Sales volume Διαστάσεις: Product, Region, Date Ιεραρχίες διαστάσεων: Product Industry Category Country Region Year Quarter Month Product City Month Week Store Day

Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy

attice A lattice is a partially ordered set (poset) in which every pair of elements has a unique supremum and an inifimum The hierarchy of levels is formally defined as a lattice (,<) such that = ( 1,..., n, A) is a finite set of levels and < is a partial order defined among the levels of such that 1 < i <A 1 i n. the upper bound is always the level A, so that we can group all values into the single value all. The lower bound of the lattice is the most detailed level of the dimension.

Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy

Distances in the same level of Hierarchy et a dimension D, its levels of hierarchies 1 < i <A and two specific values x and y s.t. x, y i All 2 1

Distances in the same level of Hierarchy Explicit Based on instance Based on property Minkowski Path WRT to descendant WRT to ancestor Highway

Explicit assignment (Instance) All n 2 distances for the n values of the dom( i ) Identity dist( x,y ) = 0,ifx 1,ifx = y y i x 1 x 2 x n Example dist (x, y) = dist (milk, milk)= 0 dist (x, y) = dist (milk, yogurt)=1 All Products milk yogurt cola

Explicit assignment (Property) Attribute based level attributes: v [v 1 v n ] dom() Distance defined wrt attributes a 1,a2,..., an Example et the level Product have attributes: [color, weight] dist (x, y)= dist (milk, cola)= dist (color(milk), color(cola))= dist (white, black) All Products milk yogurt cola

Explicit assignment (Property) Applying ancestor function et x, y i then dist( x, y ) = Examples 0,anc 1,anc i ( x ) = ( x ) dist (milk, yogurt) =0 dist (milk, cola)=1 Applying descendant relation i j j anc anc i i j j ( ( Product Type y ) y ) dairy All Soft drink milk yogurt cola orange

Minkowski family reduce to Manhattan distance: x-y Example dist (x, y)=dist(1995-2005) = 1995-2005 =10 All Year 1990 1995 2000 2005

Path distances With respect to descendants ower level et x and y i and j a lower level, i.e., j < i dist( x,y ) = dist( f ( desc i j ( x )), f ( desc i j ( y ))) f : a function that picks one of the descendants Special case when j is the detailed level 1

Path distances WRT detailed level dist( x,y ) = dist( f ( desc i j ( x )), f ( desc i j ( y ))) Example dist (Greece, France)= dist (Athens, Paris) All f Where let ( desc Country City ( Greece )) = Athens Country Greece France f ( desc Country City ( France )) = Paris City Ioannina Athens Paris Nantes

Path distances With respect to the least common ancestor et x and y i and j the upper level, i.e., j < i where x and y have a common ancestor, dist( x, y ) = dist( i x,anc ( x )) + j dist( anc i j ( y ),y ) Example dist (x, y)= dist (Athens, Paris)= dist (Athens, Europe) + +dist (Europe, Paris) = 2 + 2 = 4 Continent Country Greece All Europe France City Ioannina Athens Paris Nantes

Highway distance et the values i form a set of k clusters, each cluster has a representative r k dist(x, y)= dist(x, r x )+ dist(r x, r y )+ dist(y, r y ) Specify: k 2 distances: dist (r x, r y ) and k distances: dist(x, r x ) Example: Country dist (Ioannina, Nantes)= dist (Ioannina, Athens) + Greece dist (Athens, Paris) + dist (Paris, Nantes) City All France Ioannina Athens Paris Nantes

Outline Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OAP attice Distance in same level of hierarchy Distance in different level of hierarchy

Distances in different levels of Hierarchy Explicit dist 1 + dist 2 dist 3 +dist 4 WRT the detailed level WRT their least common ancestor Highway Attribute Based

Distances in different levels of Hierarchy et a dimension D, its levels of hierarchies 1 < i <A two specific values x and y s. t. x x y y y x y dist 2 y x < y dist 1 dist 3 ancestor of x in level y y x y = anc ( x ) x x x dist 4 y x a descendant of y in level x y x = desc y x ( y )

Distances in different levels of Hierarchy y x y dist 2 dist 1 dist 3 y Explicit assignment define dist x,y (x, y) x x, y y x x dist 4 y x dist 1 +dist 2 y y dist1 + dist2 = dist( x,anc ( x )) + dist( anc ( x ), x x y Where dist( anc ( x ), y ) is a distance of two values from the same level of hierarchy x special case: y is an ancestor of x then dist 2 =0 y )

Distances in different levels of Hierarchy y x y dist 2 dist 1 dist 3 y dist 3 +dist 4 y dist3 + dist4 = dist( y, f ( desc ( y ))) + f : function that picks a descendant x x dist( f ( desc x dist 4 y x y x ( y )),x ) Where y a distance of two values from the dist( f ( desc y )),x ) same level of hierarchy x ( special case: y is an ancestor of x then dist 4 =0

Distances in different levels of Hierarchy With respect to the detailed level x et and y x 1 = f ( desc 1 ( x )) 1 = f ( desc y 1 ( y )) dist( x, y ) = dist( x,x ) + dist( x1, y1 ) + dist( y, y1 1 ) Where dist(x 1, y 1 ) a distance of two values from the same level of hierarchy

Distances in different levels of Hierarchy With respect to their common ancestor et z the level of hierarchy where x and y have their first common ancestor dist( x, y ) = dist( x,anc ( x )) + dist( anc number of hops needed to reach the first common ancestor normalizing according to the height of the level z x z y ( y ), y )

Distances in different levels of Hierarchy Highway distance et every i is clustered into k i clusters and every cluster has its own representative r ki dist( x, y ) = dist( Attribute Based level attributes: v [v 1 v n ] dom() x,rx ) + dist( rx,ry ) + dist( Distance can be defined with respect to the attributes r y, y )

Types of evels Nominal = values hold the distinctness property values can be explicitly distinguished Ordinal < > values hold the distinctness property & the order property values abide by an order Interval + - values hold the distinctness, order & the addition property a unit of measurement exists there is meaning of the difference between two values