Προεπεξεργασία Δεδομένων

Σχετικά έγγραφα

Προεπεξεργασία Δεδομένων. Αποθήκες και Εξόρυξη Δεδομένων Διδάσκουσα: Μαρία Χαλκίδη

Προεπεξεργασία εδοµένων

Μηχανική Μάθηση Hypothesis Testing

EE512: Error Control Coding

5.4 The Poisson Distribution.

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Repeated measures Επαναληπτικές μετρήσεις

2 Composition. Invertible Mappings

Other Test Constructions: Likelihood Ratio & Bayes Tests

ST5224: Advanced Statistical Theory II

Homework 3 Solutions

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Lecture 34 Bootstrap confidence intervals

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

4.6 Autoregressive Moving Average Model ARMA(1,1)

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Elements of Information Theory

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

The Simply Typed Lambda Calculus

Reminders: linear functions

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

Math 6 SL Probability Distributions Practice Test Mark Scheme

HISTOGRAMS AND PERCENTILES What is the 25 th percentile of a histogram? What is the 50 th percentile for the cigarette histogram?

TMA4115 Matematikk 3

Δεδομένα (data) και Στατιστική (Statistics)

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Section 8.3 Trigonometric Equations

Λογισμικά για Στατιστική Ανάλυση. Minitab, R (ελεύθερο λογισμικό), Sas, S-Plus, Stata, StatGraphics, Mathematica (εξειδικευμένο λογισμικό για

Approximation of distance between locations on earth given by latitude and longitude

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

6.3 Forecasting ARMA processes

derivation of the Laplacian from rectangular to spherical coordinates

Statistical Inference I Locally most powerful tests

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Συστήματα Διαχείρισης Βάσεων Δεδομένων

Τελική Εξέταση =1 = 0. a b c. Τµήµα Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. HMY 626 Επεξεργασία Εικόνας

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Probability and Random Processes (Part II)

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

χ 2 test ανεξαρτησίας

Biostatistics for Health Sciences Review Sheet

C.S. 430 Assignment 6, Sample Solutions

Aquinas College. Edexcel Mathematical formulae and statistics tables DO NOT WRITE ON THIS BOOKLET

Section 9.2 Polar Equations and Graphs

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Potential Dividers. 46 minutes. 46 marks. Page 1 of 11

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Fractional Colorings and Zykov Products of graphs

Solution Series 9. i=1 x i and i=1 x i.

Modbus basic setup notes for IO-Link AL1xxx Master Block

Concrete Mathematics Exercises from 30 September 2016

Matrices and Determinants

Statistics & Research methods. Athanasios Papaioannou University of Thessaly Dept. of PE & Sport Science

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

FORMULAS FOR STATISTICS 1

Instruction Execution Times

Tridiagonal matrices. Gérard MEURANT. October, 2008

Inverse trigonometric functions & General Solution of Trigonometric Equations

(C) 2010 Pearson Education, Inc. All rights reserved.

Math221: HW# 1 solutions

Calculating the propagation delay of coaxial cable

Section 7.6 Double and Half Angle Formulas

Εργαστήριο στατιστικής Στατιστικό πακέτο S.P.S.S.

A Note on Intuitionistic Fuzzy. Equivalence Relation

Block Ciphers Modes. Ramki Thurimella

Μαντζούνη, Πιπερίγκου, Χατζή. ΒΙΟΣΤΑΤΙΣΤΙΚΗ Εργαστήριο 5 ο

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Μενύχτα, Πιπερίγκου, Σαββάτης. ΒΙΟΣΤΑΤΙΣΤΙΚΗ Εργαστήριο 5 ο

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

ΑΓΓΛΙΚΑ Ι. Ενότητα 7α: Impact of the Internet on Economic Education. Ζωή Κανταρίδου Τμήμα Εφαρμοσμένης Πληροφορικής

Second Order Partial Differential Equations

Υπολογιστική Φυσική Στοιχειωδών Σωματιδίων

Μορφοποίηση υπό όρους : Μορφή > Μορφοποίηση υπό όρους/γραμμές δεδομένων/μορφοποίηση μόο των κελιών που περιέχουν/

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

1) Formulation of the Problem as a Linear Programming Model

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Durbin-Levinson recursive method

Numerical Analysis FMN011

Στο εστιατόριο «ToDokimasesPrinToBgaleisStonKosmo?» έξω από τους δακτυλίους του Κρόνου, οι παραγγελίες γίνονται ηλεκτρονικά.

Areas and Lengths in Polar Coordinates

Λογισμικά για Στατιστική Ανάλυση. Minitab, R (ελεύθερο λογισμικό), Sas, S-Plus, Stata, StatGraphics, Mathematica (εξειδικευμένο λογισμικό για

UDZ Swirl diffuser. Product facts. Quick-selection. Swirl diffuser UDZ. Product code example:

CE 530 Molecular Simulation

ΑΛΓΟΡΙΘΜΟΙ ΕΞΟΡΥΞΗΣ ΠΛΗΡΟΦΟΡΙΑΣ

Problem Set 3: Solutions

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Transcript:

Προεπεξεργασία Δεδομένων Εξόρυξη Δεδομένων και Αλγόριθμοι Μάθησης 1 ο Φροντιστήριο Σκούρα Αγγελική skoura@ceid.upatras.gr

2 Η Διαδικασία Εξόρυξης Γνώσης Ορισμός προβλήματος Συλλογή δεδομένων Προεπεξεργασία δεδομένων Εφαρμογή αλγορίθμου εξόρυξης γνώσης Ερμηνεία αποτελεσμάτων

3 Κατηγορίες Συνόλων Δεδομένων Record Relational records (A relation is defined as a set of tuples that have the same attributes. In a relational database, all data are stored and accessed via relations) Data matrix: numerical matrix, crosstab (any table showing summary statistics) Document data: text documents: term-frequency vector Transaction data Graph and network World Wide Web Social networks Molecular Structures Ordered Video data: sequence of images Temporal data: time-series Sequential Data: transaction sequences Genetic sequence data Spatial, image and multimedia: Spatial data: maps Image data Video data

4 Δεδομένα Αντικειμένων Data sets are made up of data objects A data object represents an entity Examples: sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses Also called samples, examples, instances, data points, objects, tuples Data objects are described by attributes Τρόποι αποθήκευσης: Database format Database rows data objects; columns attributes Αποθήκευση δεδομένων σε ένα flat file Delimited format: tab, comma, π.χ. το Weka χρησιμοποιεί comma-delimited format

5 Κατηγορίες Μεταβλητών Δεδομένα ανάλογα με την κλίμακα μέτρησης Μεταβλητές Κατηγορικά (είναι επίπεδα ή κατηγορίες) Μετρήσεις Ποιοτικές (π.χ. φύλλο, επίπεδο μόρφωσης, περιοχή καταγωγής) Ποσοτικές (=αριθμητικές τιμές που εκφράζονται σε μια μονάδα μέτρησης, π.χ. ηλικία) Ονομαστικά (κατηγορίες που η σειρά τους δεν έχει σημασία, π.χ. χρώμα, μέσο μεταφοράς) Διατακτικά (κατηγορίες που η διάταξή τους έχει σημασία, π.χ. σοβαρότητα, γνώμη) Ασυνεχείς ή Διακριτές Συνεχείς

6 Χρησιμότητα Προεπεξεργασίας No quality data, no quality mining results! Για να έχουμε ποιοτικά αποτελέσματα από την εξόρυξη γνώσης χρειαζόμαστε ποιοτικά δεδομένα π.χ., διπλοτυπίες ή τα ελλιπή δεδομένα μπορεί να παράγουν λανθασμένα ή παραπλανητικά συμπεράσματα Πιο συγκεκριμένα, τα δεδομένα συνήθως είναι ακάθαρτα Δεν είναι ολοκληρωμένα: λείπουν τιμές χαρακτηριστικών occupation= (missing data) Περιέχουν θόρυβο, σφάλματα ή outliers Salary= 10 (an error) Είναι αντιφατικά: περιέχουν ασυμφωνίες σε κωδικούς ή ονόματα Age= 42 Birthday= 03/07/1997 Was rating 1,2,3, now rating A, B, C Μετά την προεπεξεργασία Οι αποθήκες δεδομένων θα πρέπει να περιέχουν συνεπή, ενοποιημένα και ποιοτικά δεδομένα

7 Βασικά βήματα προεπεξεργασίας A. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών, απαλοιφή θορύβου, απομάκρυνση των outliers, διόρθωση ασυνεπειών, απαλοιφή πλεονασμού B. Ενοποίηση δεδομένων (Data integration) Ενοποίηση πολλαπλών βάσεων δεδομένων, κύβων δεδομένων ή αρχείων, απαλοιφή πλεονασμού Γ. Μετασχηματισμός δεδομένων (Data transformation) και Διακριτοποίηση δεδομένων (Data discretization) Κανονικοποίηση, Μετατροπή των numerical τιμών σε nominal Δ. Μείωση δεδομένων (Data reduction) Μείωση διαστατικότητας, μείωση πληθυκότητας, συμπίεση δεδομένων

8 Βασικά βήματα προεπεξεργασίας Α. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών, απαλοιφή θορύβου, απομάκρυνση των outliers, διόρθωση ασυνεπειών, απαλοιφή πλεονασμού Β. Ενοποίηση δεδομένων (Data integration) Ενοποίηση πολλαπλών βάσεων δεδομένων, κύβων δεδομένων ή αρχείων, απαλοιφή πλεονασμού Γ. Μετασχηματισμός δεδομένων (Data transformation) και Διακριτοποίηση δεδομένων (Data discretization) Κανονικοποίηση, Μετατροπή των numerical τιμών σε nominal Δ. Μείωση δεδομένων (Data reduction) Μείωση διαστατικότητας, μείωση πληθυκότητας, συμπίεση δεδομένων

9 Α. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών Αναγνώριση των outliers και εξομάλυνση δεδομένων Διόρθωση ασυνεπειών στα δεδομένα Απαλοιφή πλεονασμού που προκύπτει από την ενοποίηση των δεδομένων Ignore the tuple: usually done when class label is missing (when performing classification) not effective when there exist considerable missing values per attribute Τα missing data μπορεί να πρέπει να εξαχθούν από τα υπάρχοντα δεδομένα με τεχνικές εξαγωγής συμπερασμάτων Fill in the missing value manually usually tedious + infeasible Fill in it automatically with a global constant: e.g., unknown the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inferencebased such as Bayesian formula or decision tree

10 Α. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών Αναγνώριση των outliers και εξομάλυνση δεδομένων Διόρθωση ασυνεπειών στα δεδομένα Απαλοιφή πλεονασμού που προκύπτει από την ενοποίηση των δεδομένων Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression Εξομάλυνση των δεδομένων με χρήση των regression functions Clustering detect and remove outliers

11 Εξομάλυνση Δεδομένων: Binning method Sorted data for temperature (in C): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

12 Εξομάλυνση δεδομένων και εντοπισμός outliers Clustering Linear Regression

13 Α. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών Αναγνώριση των outliers και εξομάλυνση δεδομένων Διόρθωση ασυνεπειών στα δεδομένα Απαλοιφή πλεονασμού που προκύπτει από την ενοποίηση των δεδομένων Data discrepancy detection Use metadata (e.g., domain, range, dependency, distribution) Check field overloading Check uniqueness rule, consecutive rule and null rule Use commercial tools Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

14 Α. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών Αναγνώριση των outliers και εξομάλυνση δεδομένων Διόρθωση ασυνεπειών στα δεδομένα Απαλοιφή πλεονασμού που προκύπτει από την ενοποίηση των δεδομένων Το πρόβλημα του πλεονασμού μπορεί να υπάρχει εξαρχής αλλά και να προκύψει μετά την Ενοποίηση των Δεδομένων (Data integration) Για τρόπους αντιμετώπισης, βλ. επόμενες διαφάνειες

15 Βασικά βήματα προεπεξεργασίας Α. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών, απαλοιφή θορύβου, απομάκρυνση των outliers, διόρθωση ασυνεπειών, απαλοιφή πλεονασμού Β. Ενοποίηση δεδομένων (Data integration) Ενοποίηση πολλαπλών βάσεων δεδομένων, κύβων δεδομένων ή αρχείων, απαλοιφή πλεονασμού Γ. Μετασχηματισμός δεδομένων (Data transformation) και Διακριτοποίηση δεδομένων (Data discretization) Κανονικοποίηση, Μετατροπή των numerical τιμών σε nominal Δ. Μείωση δεδομένων (Data reduction) Μείωση διαστατικότητας, μείωση πληθυκότητας, συμπίεση δεδομένων

16 Β. Ενοποίηση δεδομένων (Data integration) Data integration Combines data from multiple sources into a coherent store Schema integration e.g., A.cust B.cust-# Integrate metadata from different sources Entity identification problem Identify real world entities from multiple data sources e.g., Bill Newton= William Newton Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs. British units

17 Χειρισμός Πλεονασμού στην Ενοποίηση Δεδομένων Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a derived attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis

18 Correlation Analysis Correlation Analysis Categorical Data Numerical Data χ 2 test Fisher s exact test. covariance t-test ANOVA.

19 Correlation Analysis (for Categorical Data) χ 2 (chi-square) test The larger the Χ 2 value, the more likely the variables are related Example Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, but in the Pharmacology Honours class over the past ten years there have been 80 females and 40 males. Is this a significant departure from expectation? It is given that concerning the χ 2 table, the critical value for p = 0.05 and 1 degree of freedom is 3.84

20 Chi-square test example Observed numbers (O) Expected numbers (E) Female Male Total 80 40 120 60 60 120 O - E 20-20 0 * 2 (O-E) 2 400 400 (O-E) 2 / E 6.67 6.67 13.34 = X 2 Set out a table as shown below, with the "observed" numbers and the "expected" numbers (i.e. our null hypothesis). Then subtract each "expected" value from the corresponding "observed" value (O-E) Square the "O-E" values, and divide each by the relevant "expected" value to give (O-E) 2 /E Add all the (O-E) 2 /E values and call the total "X 2 " Now we must compare our X 2 value with a χ 2 (chi squared) value in a table of χ 2 with n-1 degrees of freedom (where n is the number of categories, i.e. 2 in our case - males and females) We have only one degree of freedom (n-1) If our calculated value of X 2 exceeds the critical value of χ 2 then we have a significant difference from the expectation In fact, our calculated X 2 (13.34) exceeds even the tabulated χ 2 value (10.83) for p = 0.001 This shows an extreme departure from expectation. It is still possible that we could have got this result by chance - a probability of less than 1 in 1000 But we could be 99.9% confident that some factor leads to a "bias" towards females entering Pharmacology Honours.

21 Τable of Chi-square test Degrees of Freedom Probability p 0.99 0.95 0.05 0.01 0.001 1 0.000 0.004 3.84 6.64 10.83 2 0.020 0.103 5.99 9.21 13.82 3 0.115 0.352 7.82 11.35 16.27 4 0.297 0.711 9.49 13.28 18.47 5 0.554 1.145 11.07 15.09 20.52 6 0.872 1.635 12.59 16.81 22.46 7 1.239 2.167 14.07 18.48 24.32 8 1.646 2.733 15.51 20.09 26.13 A significant difference from your null hypothesis (i.e. difference from your expectation) is indicated when your calculated X 2 value is greater than the χ 2 value shown in the 0.05 column of this table (i.e. there is only a 5% probability that your calculated X 2 value would occur by chance) You can be even more confident if your calculated value exceeds the χ 2 values in the 0.01 or 0.001 probability columns If your calculated X 2 value is equal to, or less than, the tabulated χ 2 value for 0.95 then your results give you no reason to reject the null hypothesis (the expectation)

22 Correlation Analysis (for Numerical Data) In probability theory and statistics, the mathematical descriptions of covariance and correlation are very similar Both provides a measure of the strength of the correlation between two or more sets of random variables Correlation and covariance measure only the linear relationship between objects The Covariance for two random variables X and Y, each of size N, is defined by as cov(x, Y) = <(X-μ x )(Y-μ Y )> = <X Y> - μ x μ Y where μ X =<X>, μ Y =<Y> are the respective means Covariance can be written out explicitly as For uncorrelated variables, cov(x, Y) = <X Y> - μ x μ Y = <X><Y> - μ x μ Y = 0. If the variables are correlated in some way, then their covariance will be nonzero In fact, if cov(x, Y) >0, then Y tends to increase as X increases, and if cov(x, Y) <0, then Y tends to decrease as X increases. Note that while statistically independent variables are always uncorrelated, the converse is not necessarily true. In the special case of Y=X, the covariance reduces to the usual variance

23 Co-Variance example Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). Question: If the stocks are affected by the same industry trends, will their prices rise or fall together? Answer: E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 Cov(A,B) = (2 5+3 8+5 10+4 11+6 14)/5 4 9.6 = 4 Thus, A and B rise together since Cov(A, B) > 0

24 Βασικά βήματα προεπεξεργασίας Α. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών, απαλοιφή θορύβου, απομάκρυνση των outliers, διόρθωση ασυνεπειών, απαλοιφή πλεονασμού Β. Ενοποίηση δεδομένων (Data integration) Ενοποίηση πολλαπλών βάσεων δεδομένων, κύβων δεδομένων ή αρχείων, απαλοιφή πλεονασμού Γ. Μετασχηματισμός δεδομένων (Data transformation) και Διακριτοποίηση δεδομένων (Data discretization) Κανονικοποίηση, Μετατροπή των numerical τιμών σε nominal Δ. Μείωση δεδομένων (Data reduction) Μείωση διαστατικότητας, μείωση πληθυκότητας, συμπίεση δεδομένων

25 Γ. Μετασχηματισμός δεδομένων (Data transformation) Smoothing: Remove noise from data Aggregation: Summarization, data cube construction Generalization: Concept hierarchy climbing Normalization: Scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones

26 Τεχνικές Κανονικοποίησης Σκοπός της κανονικοποίησης: η αντιστοίχιση των τιμών των δεδομένων από το διάστημα [mina, maxa] [new_mina, new_maxa] Min-max normalization: Επίσης, υπάρχουν παραλλαγές της min max κανονικοποίησης ώστε το διάστημα [new_min, new_max] να μην είναι κατ ανάγκη το [0,1] Decimal scaling: (όταν τα δεδομένα προέρχονται από πηγές που διαφέρουν με λογαριθμικό παράγοντα). Παράδειγμα μια πηγή έχει εύρος τιμών [0,1] και μια άλλη πηγή έχει εύρος τιμών [0, 1000]. Σε αυτήν χρησιμοποιείται η τεχνική Decimal scaling.

27 Min-Max Τεχνική Κανονικοποίησης Θεωρούμε τα δεδομένα από 30-50 και έστω ότι θέλουμε να τα μετασχηματίσουμε ώστε να κυμαίνονται από 0-1. Θα χρησιμοποιήσουμε Min-max normalization Το στοιχείο s=30 αντιστοιχίζεται ως εξής: s 30 = (30-30)/(50-30) = 0 Το στοιχείο s=50 αντιστοιχίζεται ως εξής: s 50 = (50-30)/(50-30) = 1 Το ενδιάμεσο στοιχείο s=35 αντιστοιχίζεται ως εξής: s 35 = (35-30)/(50-30) = 5/20 = 0.25

29 Διακριτοποίηση δεδομένων (Data discretization) Μερικές μεθοδολογίες εξόρυξης δεδομένων μπορούν να χειριστούν nominal τιμές εσωτερικά Διακριτοποίηση = μετατροπή Numeric Nominal Διαχωρισμός του πεδίου των χαρακτηριστικών σε intervals Οι ετικέτες των intervals χρησιμοποιούνται μετά για την αντικατάσταση των δεδομένων Τυπικές μέθοδοι για διακριτοποίηση (All the methods can be applied recursively) Binning Histogram analysis Clustering analysis Σημ.: Αντίθετα, άλλες μέθοδοι (regression, nearest neighbor) απαιτούν μόνο numeric τιμές Για να χρησιμοποιήσουμε τα nominal πεδία σε τέτοιες μεθόδους πρέπει να τις μετατρέψουμε σε numeric τιμές

30 Βασικά βήματα προεπεξεργασίας Α. Καθαρισμός δεδομένων (Data cleaning) Συμπλήρωση των χαμένων τιμών, απαλοιφή θορύβου, απομάκρυνση των outliers, διόρθωση ασυνεπειών, απαλοιφή πλεονασμού Β. Ενοποίηση δεδομένων (Data integration) Ενοποίηση πολλαπλών βάσεων δεδομένων, κύβων δεδομένων ή αρχείων, απαλοιφή πλεονασμού Γ. Μετασχηματισμός δεδομένων (Data transformation) και Διακριτοποίηση δεδομένων (Data discretization) Κανονικοποίηση, Μετατροπή των numerical τιμών σε nominal Δ. Μείωση δεδομένων (Data reduction) Μείωση διαστατικότητας, μείωση πληθυκότητας, συμπίεση δεδομένων

31 Δ. Μείωση δεδομένων (Data reduction) Πρόβλημα: Μεγάλες αποθήκες δεδομένων μπορούν να έχουν terabytes δεδομένων Πολύπλοκη ανάλυση δεδομένων και εξόρυξη γνώσης μπορεί να απαιτήσει πολύ χρόνο Λύση: Μείωση δεδομένων, δηλαδή να διατηρούνται μειωμένες αναπαραστάσεις δεδομένων σε χωρητικότητα αλλά πρέπει να διατηρούνται ίδια ή παρόμοια αποτελέσματα ανάλυσης Στρατηγικές: Data cube aggregation Dimension Reduction Instance Selection Value Discretization Συμπίεση δεδομένων Numerosity reduction

32 Data Cube Aggregation Το χαμηλότερο επίπεδο ενός data cube Τα συναθροισμένα δεδομένα για μια ξεχωριστή οντότητα ενδιαφέροντος Πολλαπλά επίπεδα συνάθροισης σε data cubes Επιπλέον μείωση του μεγέθους των δεδομένων που θα χρησιμοποιηθούν Αναφορά σε κατάλληλα επίπεδα Χρησιμοποιούμε την λιγότερη δυνατή πληροφορία για την επίλυση του προβλήματος μας

33 Μείωση διαστάσεων Μπορεί να επιτευχθεί με δύο μεθόδους: Επιλογή χαρακτηριστικών: Επιλογή ενός ελάχιστου πλήθους m χαρακτηριστικών με τα οποία είναι δυνατή η εξαγωγή ισοδύναμων ή κοντινών αποτελεσμάτων με αυτά που θα είχαμε αν είχαμε κρατήσει όλα τα χαρακτηριστικά για ανάλυση n. Ιδανικά m <<< n. Μετασχηματισμός χαρακτηριστικών: Είναι γνωστός ως Principle Component Analysis. Ο μετασχηματισμός των χαρακτηριστικών δημιουργεί ένα νέο σύνολο χαρακτηριστικών, λιγότερων διαστάσεων από το αρχικό, αλλά χωρίς μείωση των βασικών διαστάσεων. Επίσης, συχνά χρησιμοποιείται για την οπτικοποίηση των δεδομένων.

34 Instance Selection Η επιλογή περιπτώσεων (instance selection) μπορεί να επιτευχθεί με δύο τύπους μεθόδων: Sampling methods : Random Sampling - randomly select "m" instances from the "n" initial instances. Stratified Sampling - randomly select "m" instances from the "n" initial instances, such that the distribution of classes is maintained in the selected sample. Search-based methods : Search for representative instances in the data, based on some criterion and remove the remaining instances. Use Statistical measures (number of instances, mean or standard deviations) to replace redundant instances with their representative pseudo-instances.

Συμπίεση δεδομένων 35

36 Συμπίεση δεδομένων: Huffman Huffman Coding = lossless compression algorithms A Huffman encoder takes a block of input characters with fixed length and produces a block of output bits of variable length It is a fixed-to-variable length code The design of the Huffman code is optimal (for a fixed blocklength) assuming that the source statistics are known a priori A Huffman code is designed by merging together the two least probable characters, and repeating this process until there is only one character remaining. A code tree is thus generated and the Huffman code is obtained from the labeling of the code tree.

37 Συμπίεση δεδομένων: Huffman Gif of the procedure at http://www.data-compression.com/lossless.shtml

38 Συμπίεση σημάτων Wavelet Μετασχηματισμοί Τεχνική που εφαρμόζεται σε ένα διάνυσμα D και το μετασχηματίζει σε ένα αριθμητικά διαφορετικό διάνυσμα D ίδιου μήκους Κυρίως χρησιμοποιείται για συμπίεση χρονοσειρών Παράδειγμα 2 τύπων wavelet μετασχηματισμών Daubechies μετασχηματισμός Haar μετασχηματισμός

39 Μείωση Πολυαριθμίας (Numerosity reduction) Παραμετροποιήσιμες μέθοδοι Χρησιμοποιείται ένα μοντέλο (ή μια συνάρτηση) για την εκτίμηση των δεδομένων και έτσι αποθηκεύονται μόνο οι παράμετροι του αντί των δεδομένων Log-linear μοντέλα τα οποία διατηρούν διακριτά πολυδιάστατες πιθανοτικές κατανομές Μη-παραμετροποιήσιμες μέθοδοι Ιστογράμματα Συσταδοποίηση Δειγματοληψία

40 Τέλος παρουσίασης Απορίες?