Υποστηρικτικό Υλικό για Πτυχιακές και MSc. Π. Βασιλειάδης

Υποστηρικτικό Υλικό για Πτυχιακές και MSc Π. Βασιλειάδης

RADAR: Radial Applications Depiction Around Relations For Data-Centric Ecosystems Panos Vassiliadis http://www.cs.uoi.gr/~pvassil/publications/2011_dali/index.html

Data centric ecosystems & the need for a map Act3 Act4 Act5 Act2 Act1 WWW 3

Graph model The ecosystem is a bipartite graph G(V,E) Nodes: relations and queries Edges: query q uses a relation r in any way Simplest possible model: we only care for the usage of a relation by a query For the future Query semantics Views & constraints 4

Fan-out = 1,2 Things are nice and calm in radar city Observe the conflict resolution 6

Observe the angle Fan-out = 1,2,3 7

Fan-out = 1,2,3 See how the concentric circles work It s called RADAR, remember? 8

Data Warehouses, their refreshment and ETL Panos Vassiliadis

Data Warehouse Environment 10

Extract-Transform-Load (ETL) Extract Transform & Clean Load Sources DSA DW 11

Importance ETL market has a steady increase rate of approximately 20.1% each year, while it becomes a $667 million market in 2001 (Giga 02) ETL and Data Cleaning tools cost 30% of effort and expenses in the budget of the DW (Enterprise Information Portals) 55% of the total costs of DW runtime (Inmon) 80% of the development time in a DW project (Demarest) ETL tools will not be replaced by EAI (Enterprise Application Integration) tools in near future (Giga 02) ETL tools will be used in other areas beyond DWs (Gartner 04) 12

ETL workflows DS.PS_NEW DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY SUPPKEY=1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE 1 DS.PS_OLD 1 DIFF 1 DS.PS 1 Add_SPK 1 SK 1 Log rejected $2 rejected Log A2EDate rejected Log U DS.PS_NEW 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY SUPPKEY=2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE=SYSDATE QTY>0 DIFF 2 DS.PS 2 Add_SPK 2 SK 2 NotNULL AddDate CheckQTY DS.PS_OLD rejected rejected 2 Log Log DSA PKEY, DAY MIN(COST) S 1 _PARTSU PP FTP 1 DW.PARTSU PP Aggregate 1 V1 DW.PARTSUPP.DATE, DAY PKEY, MONTH AVG(COST) S 2 _PARTSU PP FTP 2 TIME Aggregate 2 V2 Sources DW 13

Value Incompatibility (example of surrogate keys) ID Descr 10 Coke 20 Pepsi R1? DW.R ID Descr???????? ID Descr 10 Pepsi 20 Fanta R2 14

Data mappings? Source 1: Personnel (Cobol) EMP ID Name DoB Salary Total Incom e DeptID 110 Kostas 1/1/72 1500 1200 132 DW.EMP Source 2: Accounting (DB2) EMP ID IL_ID Amount 110 10 1500 110 30 300 EMP INCOME EMP ID Name Age DW 110 Kostas 30 120 Mitsos 48 130 Roula 29 EMP IL_ID Descr 10 Salary 20 Bonus 1 30 Tax...... Income Lookup 15

MS SSIS SQL Server Integration Services 16

Talend Open Studio for Data Integration www.talend.com/download_form.php?cont=ge n&src=homepage 17

Pentaho s Kettle http://kettle.pentaho.com/ 18

OLAP & data cubes Panos Vassiliadis 19

OLAP Αφορά την ανάλυση κάποιων μετρήσιμων μεγεθών (μέτρων) πωλήσεις, απόθεμα, κέρδος,... Διαστάσεις: παράμετροι που καθορίζουν το περιβάλλον (context) των μέτρων ημερομηνία, προϊόν, τοποθεσία, πωλητής, Κύβοι: συνδυασμοί διαστάσεων που καθορίζουν κάποια μέτρα Ο κύβος καθορίζει ένα πολυδιάστατο χώρο διαστάσεων, με τα μέτρα να είναι σημεία του χώρου αυτού 20

Κύβοι για OLAP N S W PRODUCT Juice Cola Soap 10 13 Jan MONTH

Κύβοι για OLAP

Βασικές Έννοιες OLAP Τα δεδομένα θεωρούνται αποθηκευμένα σε ένα πολυδιάστατο πίνακα (multi-dimensional array), ο οποίος αποκαλείται και κύβος ή υπερκύβος (Cube και HyperCube αντίστοιχα). Ο κύβος είναι μια ομάδα από κελιά δεδομένων (data cells). Κάθε κελί χαρακτηρίζεται μονοσήμαντα από τις αντίστοιχες τιμές των διαστάσεων (dimensions) του κύβου. Τα περιεχόμενα του κελιού ονομάζονται μέτρα (measures) και αναπαριστούν τις αποτιμώμενες αξίες του πραγματικού κόσμου.

Ιεραρχίες επιπέδων για OLAP Μια διάσταση μοντελοποιεί όλους τους τρόπους με τους οποίους τα δεδομένα μπορούν να συναθροιστούν σε σχέση με μια συγκεκριμένη παράμετρο του περιεχομένου τους. Ημερομηνία, Προϊόν, Τοποθεσία, Πωλητής, Κάθε διάσταση έχει μια σχετική ιεραρχία επιπέδων συνάθροισης των δεδομένων (hierarchy of levels). Αυτό σημαίνει, ότι η διάσταση μπορεί να θεωρηθεί από πολλά επίπεδα αδρομέρειας. Ημερομηνία: μέρα, εβδομάδα, μήνας, χρόνος,

Ιεραρχίες Επιπέδων Ιεραρχίες Επιπέδων: κάθε διάσταση οργανώνεται σε διαφορετικά επίπεδα αδρομέρειας Ο χρήστης μπορεί να πλοηγηθεί από το ένα επίπεδο στο άλλο, δημιουργώντας νέους κύβους κάθε φορά Year Month Day Week Αδρομέρεια: το αντίθετο της λεπτομέρειας -- ο σωστός όρος είναι αδρομέρεια... 25

Κύβοι & ιεραρχίες διαστάσεων για OLAP Sales volume Διαστάσεις: Product, Region, Date Product Ιεραρχίες διαστάσεων: Industry Country Category Region Year Quarter Month Product City Month Week Store Day

Εργασίες που κάνει ο χρήστης Συνήθεις πράξεις που κάνουμε σε κύβους Συναθροίσεις (total sales, percent-to-total) Συγκρίσεις (budget vs. expense) Ταξινόμηση - κατάταξη (top 10) Πρόσβαση σε πιο αναλυτική πληροφορία Οπτικοποίηση με διαφορετικούς τρόπους

Roll up Industry Country Year Category Region Quarter Sales volume Product City Store Month Week Day Products Store1 Store2 Q1 Q2 Electronics Toys Clothing Cosmetics Electronics Toys Clothing Cosmetics $5,2 $1,9 $2,3 $1,1 $8,9 $0,75 $4,6 $1,5 $5,6 $1,4 $2,6 $1,1 $7,2 $0,4 $4,6 $0,5 Χρόνος: Επίπεδο Quarter Year 1996 Sales volume Products Electronics Toys Clothing Cosmetics Store1 Store2 $14,1 $2,65 $6,9 $2,6 Χρόνος: Επίπεδο Year $12,8 $1,8 $7,2 $1,6 SUM(Sales volumes)

Privacy for published data Panos Vassiliadis 29

Model of data publishing Deborah, a star DBA & a TRUSTED data publisher Detailed Data T Anonymized Data T* Ben, the benevolent (& intelligent) data miner Bob (the victim) to be hidden Alice, the attacker (a.k.a. the adversary) 30

Identifier(s): attribute(s) that explicitly reveal the identity of a person (name, SSN, ). These attributes are removed from the public data set Quasi identifier: attribute(s) that if joined with external data can reveal sensitive information Sensitive attribute: containing the values that should be kept private 31

Fundamental anonymization technique: hide individual in groups of similar values!! Here: each individual is hidden in a group with values being generalizations of the specific values of the data set 32

K-anonymity Assume R and a public attribute Q SELECT Q, count(*) FROM R GROUP BY Q If every group formed has more than k tuples then the data set R is k-anonymous wrt. Q 33

K-anonymity: what and how To achieve anonymity we perform two operations: Suppress all values that cannot fit in a group of size at least k Generalize the common values of the members of the same group in a more abstract value The ultimate goal is to find the anonymization scheme that minimizes suppression and generalization while guaranteeing k-anonymity! 34

Generalization hierarchies Zip code hierarchy Race hierarchy 35

Lattice The combination of hierarchy levels creates a lattice Here: 3 dimensions, Age, Race, Zip A node is characterized by the levels for each dimension Node 412 means level 4 for age, level 1 for race, level 2 for zip

A. Pilalidou s MSc + E. Kontogiannopoulou s Diploma User input K = MaxSupp = H = [,, ] We compute a histogram for each of the lattice s nodes The algorithm checks whether the 3 constraints can be met by a node; else it suggests alternatives 37