Lecture Notes for Chapter 8

Data Mnng Cluster Analyss Lecture otes for Chapter 8

Clusterng Target: Dvde data nto a set of groups (clusters) based on smlarty Smlar samples are grouped together, whle dssmlar samples are placed n dfferent clusters Input dataset: unlabeled data Avalable only the nformaton of feature values X,,, Unsupervsed learnng

Eamples (I). Input dataset Clusterng result

Eamples (II)

Advantages of Clusterng Clusterng for data understandng Dscover dynamcally categores of data Clusterng asssts Informaton retreval effcent fndng nearest neghbors Summarzaton of data Compresson (vector quantzaton)

Goal: Cluster Analyss Obects that belong to the same cluster are more smlar to each other, and smultaneously dffer from rest obects that belong to other clusters. The more smlar among members of same cluster (ntercluster), the more dfference among clusters (ntracluster). There s an obectve dffculty

Cluster Analyss (cont.). Whch s the optmum number of clusters? Soluton wth clusters Soluton wth 4 clusters Soluton wth 6 clusters

Clusterng approaches Parttonng (model-free) methods: Dvde the nput data set nto non-overlappng subsets - groups (clusters). Any pont belongs eclusvely to a sngle cluster. Herarchcal methods: Buld a herarchy of clusters organzed n a tree structure Smlarty-based methods: Use a smlarty matr of data and make a spectral analyss of t (graph-based clusterng). Model-based methods: Every cluster s descrbed by a parametrc model. Durng learnng model parameters are estmated n order to ft the data. Any pont belongs to several clusters wth dfferent degrees.

Eclusve vs. on-eclusve (Overlappng) Clusterng Eclusve Any pont belongs eclusvely to a sngle cluster. Overlappng Any pont may belong to several clusters wth dfferent degree. e.g. probablstc clusters (probablty of belongngness) Fuzzy clusters (membershp value)

Complete vs. Partal Clusterng Complete Clusterng s performed to all data Partal Some eamples may not partcpate to clusterng procedure, ether because they do not belong to well shaped clusters, or because they are nosy data or outlers and may negatvely affect clusterng

Types of Clusters Well-separated clusters Any pont s more smlar to all ponts of the same cluster n comparson wth ponts from other clusters Prototype-based or Center-based clusters The dstance of any pont wth the cluster center t belongs s less than dstances wth other clusters centers The center of a cluster s often a centrod, the average of all the ponts n the cluster, or a medod, the most representatve pont of a cluster

Types of Clusters (cont.) Geometrc clusters Clusters have geometrc propertes,.e. have geometrc rules to dentfy whch data pont belong to them. For eample, hyperplanes or hyperspheres that surround a cluster regon. Graph-based clusters Graph representaton of data where data ponts are vertces whch communcate only wth other ponts (vertces) of the same cluster. Clusters as clques.

Types of Clusters (cont.) Densty-based clusters Cluster s a regon of hgh densty that surrounds smlar ponts and s separated wth other clusters wth regons of mnmum varance Property or Conceptual clusters Cluster s a set of data sharng a common property (e.g. dstance, geometry)

[]. Parttonng Clusterng: fndng cluster representatves Every cluster Ω s descrbed wth a representatve (μ ) that descrbes t unquely. Summarzaton of data Representatve summarzes all cluster members Reducng dataset to a set of representatves of clusters Compresson of nformaton Vector quantzaton Useful n tet, mages, sounds and vdeo (tet, vdeo or sound summarzaton by keepng most characterstc topcs, paragraphs, or scenes)

y. 3 cluster representatves 3 Iteraton 6.5.5 0.5 0 - -.5 - -0.5 0 0.5.5

y 0 cluster representatves 8 Iteraton 4 6 4 0 - -4-6 0 5 0 5 0

Parttonng method Decson mechansm: A pattern belongs eclusvely to the closest cluster * that has the mnmum dstance (or the hghest smlarty) wth ts representatve Learnng goal: * arg mn,, Estmate the proper values of the representatves { μ } gven an nput set of eamples. d,

Ο αλγόριθμος Κ-means (MacQueen, 967) Input dataset Goal: dvson of set X nto Κ clusters and dscovery representatves { μ }. (: known) Cluster representatves: Means or center of data belong to the same cluster. Rule: Every pont belongs to the cluster wth the mnmum dstance of ts center. Obectve functon: X,,, E mn,, d,

Ο αλγόριθμος Κ-means (MacQueen, 967) Αρχικοποίηση (t=0) των Κ μέσων: E Επαναληπτικά (0). Τοποθέτηση των προτύπων σε Κ ομάδες ανάλογα με την απόστασή τους από τα τρέχοντα μέσα των ομάδων,,. Ενημέρωση των Κ μέσων E ( t) mn,, d,,, q arg mn,, mn,, (0), (0), d,,, d, t ( t) q q

Ο αλγόριθμος Κ-means (MacQueen, 967) Termnaton crteron: Cluster centers stop modfed between successve steps or STOP f t t Obectve functon stops modfed between successve steps 0 STOP f E E ( t) E ( t )

y y y y y y An eample of eecuton of k-means algorthm 3 Iteraton 3 Iteraton 3 Iteraton 3.5.5.5.5.5.5 0.5 0.5 0.5 0 0 0 - -.5 - -0.5 0 0.5.5 - -.5 - -0.5 0 0.5.5 - -.5 - -0.5 0 0.5.5 3 Iteraton 4 3 Iteraton 5 3 Iteraton 6.5.5.5.5.5.5 0.5 0.5 0.5 0 0 0 - -.5 - -0.5 0 0.5.5 - -.5 - -0.5 0 0.5.5 - -.5 - -0.5 0 0.5.5

An alternatve nterpretaton (I) Assumng Eucldean spaces: Target of -means s to mnmze the sample varance of data belong to every cluster. Mnmum varance clusters constructon, or, mamum coherence clusters constructon. d E,,,, mn, mn

An alternatve nterpretaton (II) Obectve functon: E mn,, d, mn,, Obectve functon as an error functon of data to ther closest center. Durng learnng try to mnmze the sum of squared error

[]. Intalzaton strateges of cluster centers. Συνήθως τυχαία επιλογή από τα δείγματα.. Ομοιόμορφα από το πεδίο τιμών των χαρακτηριστικών 3. Πολλές επαναλήψεις του -means. Επιλογή της λύσης με την μικρότερη τιμή συνάρτησης (mn{e}). 4. Με διαδοχική επιλογή κέντρων: Επιλογή αρχικά ενός κέντρου τυχαία (=) ή συνολικό κέντρο Επαναληπτικά επιλογή ως μέσο της + ομάδας το πιο «απομακρυσμένο» σημείο του συνόλου δεδομένων από όλα τα μέσα { μ } που έχουν επιλεχθεί μέχρι το τρέχον βήμα. Έτσι περισσότερο ευδιάκριτες ομάδες στο αρχικό βήμα Κίνδυνος να επιλεγούν ως μέσα ακραίες τιμές (outlers)

y Παράδειγμα αρχικοποίησης 3 επιλογή ου κέντρου.5.5 μ 0.5 0 - -.5 - -0.5 0 0.5.5

y Παράδειγμα αρχικοποίησης 3 επιλογή ου κέντρου.5.5 μ 0.5 0 μ - -.5 - -0.5 0 0.5.5

y Παράδειγμα αρχικοποίησης 3 επιλογή 3 ου κέντρου.5.5 μ 0.5 0 μ 3 μ - -.5 - -0.5 0 0.5.5

y y y y y Παράδειγμα «κακής» αρχικοποίησης 3 Iteraton 3 Iteraton.5.5.5.5 0.5 0.5 0 0 - -.5 - -0.5 0 0.5.5 - -.5 - -0.5 0 0.5.5 3 Iteraton 3 3 Iteraton 4 3 Iteraton 5.5.5.5.5.5.5 0.5 0.5 0.5 0 0 0 - -.5 - -0.5 0 0.5.5 - -.5 - -0.5 0 0.5.5 - -.5 - -0.5 0 0.5.5

[]. Πρόβλημα με empty clusters Υπάρχει περίπτωση σε κάποιο επαναληπτικό βήμα του αλγορίθμου να υπάρχει μία κενή ομάδα, δηλ. να μην έχει κανένα σημείο. Λύση: Αντικαθιστούμε το κέντρο της κενής ομάδας με το πιο απομακρυσμένο σημείο από τα κέντρα των άλλων ομάδων.

[3]. Πρόβλημα με ακραία σημεία (outlers) Τα ακραία σημεία μπορούν να επηρεάσουν σημαντικά την διαδικασία ομαδοποίησης, καθώς μπορεί να μεταβάλλουν σημαντικά τα μέσα τους. Λύση: Μηχανισμός εντοπισμού των ακραίων σημείων, είτε πριν την ομαδοποίηση (προεπεξεργασία) είτε μετά (μετα-επεξεργασία), και αφαίρεσής τους.

[4]. Complety Η πολυπλοκότητα σε μνήμη είναι μικρή καθώς επιπλέον μόνο τα Κ κέντρα απαιτούνται. Έτσι πολυπλοκότητα σε χώρο (space) : Ο((Ν+Κ)*d), Ν: sze of dataset d: dmenson of data Η πολυπλοκότητα σε χρόνο (tme) είναι γραμμική ως προς τον αριθμό των δεδομένων, δηλ. Ο(Ν), καθώς σε κάθε επανάληψη απαιτούνται d πράξεις.

[5]. Επέκταση σε μη-ευκλείδιους χώρους (-medods) Τροποποιήσεις του βασικού σχήματος I. Συνάρτηση ομοιότητας (αντί για απόστασης) II. Αντικειμενική συνάρτηση (μεγιστοποίηση) E sm ma,,, sm, III. Διάμεσος (medod) ως κέντρο της ομάδας Ω k : ma sm, k k

[6]. Bsectng k-means (ncremental learnng) Επαναληπτικά, επιλέγουμε μία ομάδα και κάνουμε splt Αρχικά m= ομάδα με ένα κέντρο για όλα τα σημεία. Repeated untl m=. Select cluster,m havng center μ. Splt of -th cluster by eecutng k-means locally for Κ= to the subset of selected cluster s data (local k-means) 3. Two new clusters are produced wth centers: 4. m=m+ ( new), m Fnally, eecute (global) k-means wth centers to all data.

[7]. Lmtatons of -means Ο αλγόριθμος παρουσιάζει προβλήματα όταν οι ομάδες των δεδομένων είναι μη-σφαιρικές ή όταν είναι διαφορετικού μεγέθους ή διασποράς. Το μειονέκτημα του kmeans είναι ότι οι ομάδες που ψάχνει να βρει είναι του ιδίου μεγέθους, της ίδιας πυκνότητας και ότι το σχήμα τους είναι σφαιρικό. Αντιμετώπιση: Κάνουμε splt στις ομάδες στο τέλος του αλγορίθμου Εκτελώντας τον αλγόριθμο k-means για μεγαλύτερο αριθμό από clusters.

[7]. Lmtatons of -means (cont.) Clusters of dfferent shape Intal dataset -means soluton (3 Clusters)

[7]. Lmtatons of -means (cont.) Clusters of dfferent varance Intal dataset -means soluton (3 Clusters)

[7]. Lmtatons of -means (cont.) on-sphercal shaped clusters Intal dataset -means soluton ( Clusters)

[8]. Geometry of k-means Assume clusterng nto Κ= clusters. Decson mechansm of k-means: T Then we have: of the form: w T 0 b 0

[8]. Geometry of k-means (cont.) Thus: k-means constructs specfc lnear dscrmnant hyperplanes among clusters, Bsector of cluster centers lne ( μ, μ k ). μ μ

[9]. Alternatve Obectve functon (III) E, dst, E w dst,, mn όπου τα δυαδικά βάρη w εκφράζουν την πληροφορία του σε ποια ομάδα ανήκουν τα σημεία Το n w εκφράζει το πλήθος των δεδομένων που ανήκουν στην -οστή ομάδα. Cluster centers: w 0 n w

[0]. -means as an optmzaton problem Obectve functon n Eucldean spaces: Mnmzaton problem of centers { μ } Settng the dervatve of μ equal to zero: Sample mean s the optmum center parameter that mnmzes the obectve functon. d E,,, mn T T T E E ˆ 0 ˆ 0

[]. Computer Vson applcaton Image segmentaton and mage compresson Image segmentaton: Dvson of mage nto regons (segments) of the same ntensty. Let gray-scale mage of sze 5 5 pels. Usng 8 bts /pel (56 ntensty levels) a memory space of 56 kb s requred. Apply the -means clusterng approach to the 8 pel ntenstes for fndng clusters.

[]. Computer Vson applcaton (cont.) After fnshng clusterng, we assgn all pels of the same cluster wth the ntensty of ther cluster center μ they belong. Then, new space requred for mage s (3 log ) kb e.g. = 3B =4 64B =8 96B Image compresson wth an error equal to the k-means obectve functon (after convergence): E mn,, d,

[]. Fuzzy c-means Etenson of k-means usng Fuzzy sets theory Obectve functon s wrtten as u s the degree of membershp of nput to cluster, calculated (teratvely) as: Cluster centers calculaton m m u u m m u J m k m k u

[3]. ernel k-means Etenson of k-means usng kernel functon k(, ) φ() Obectve functon s wrtten as or Cluster centers d E,,, mn w E w 0 w n

Idea on ernel methods. represent a pont by ts mage n a feature space:. Domans can be completely dfferent! 3. ernel Trck: In many applcatons we do not need to know eplctly, we only need to operate computed effcently (e.g. can be nfnte dmensonal) f the kernel can be

[3]. ernel k-means (cont.) Etended to kernel k-means Each term s wrtten (kernel dstance) w E n m m n m n n n n n m m T n m n n n T n T n n n T n n n T k w w n k w n k w w n w n w n w n,,, w n n w

[3]. ernel k-means (cont.) kernel k-means obectve functon eed of kernel (gram) matr calculaton Bnary (or weghted) w by calculatng the kernel dstance n m m n m n n n n k w w n k w n k w E,,, k,

[3]. ernel k-means (cont.) -means vs. ernel -means

Learnng Vector Quantzaton - LVQ Στόχος είναι η εύρεση αντιπροσώπων { μ } ενός συνόλου δεδομένων Οι αντιπρόσωποι δρουν ως κβαντιστές πληροφορίας και προκαλούν συμπίεση των δεδομένων Ανταγωνιστική μάθηση (compettve learnng): Oι κβαντιστές συναγωνίζονται μεταξύ τους για το ποιος θα «αποκτήσει» ένα νεοεισερχόμενο πρότυπο. Η διανυσματική έκφραση του νικητή προσαρμόζεται εκδόσεις: Με ή χωρίς επίβλεψη

LVQ for clusterng Intalzaton of centers Repeat. Random selecton of an nput pont. Fnd wnner cluster: 3. Update center of wnner cluster: new old old η < learnng rate q arg mn,, (0), d, X, old new old

[]. Herarchcal Clusterng Δενδρική αναπαράσταση των δεδομένων Πλεονεκτήματα : Όχι εξάρτηση από αρχικοποίηση Ταυτόχρονα πολλαπλές λύσεις για διαφορετικό αριθμό ομάδων (ύψος δέντρου) Η μέθοδος είναι γενική καθώς μπορεί εύκολα να επεκταθεί σε μη-ευκλείδιους χώρους. Υπάρχουν τρόποι κατασκευής του δέντρου

Herarchcal Clusterng 5 4 5 4 3 3 6 0.5 0. 0.5 0. 0.05 0 3 6 4 5 ested clusters Dendrogram

Dvsve method Top-down tree constructon Intally one cluster root of tree (k=) Repeat. Select a cluster (leaf node of current tree) accordng to an approprate crteron.. Splt ths cluster (parent) nto two nonoverlappng chldren usng a proper mechansm. 3. k=k+ Untl k=m

Dvsve method (cont.) Selecton method: Usually based on ma varance crteron or cluster s sparseness. Cluster Splttng: fnd two cluster members {, } such that: mn, C n C mn,, dst, dst n n Then locate members of -cluster to two chldren accordng to dstances wth two sub-cluster representatves {, }.

Dvsve method (cont.) Splttng procedure C C () C ()

Agglomeratve method Bottom-up tree constructon.. Intally a tree wth Ν leaf-nodes (every pont forms a separated cluster). (k=) Repeat Fnd two most common clusters (parents) C, C from the current tree. And merge them to a larger cluster C=C C k=k- Untl k=m

Agglomeratve method (cont.) Dsadvantage: Hgh complety O( ) Crtera for mergng Mnmum dstance of cluster centers. Total mean dstance among the members of both clusters (Group average) Mamum dstance among both clusters members Increment of cluster s varance after mergng two clusters (Ward s method)

Agglomeratve method (cont.) Agglomeratve clusterng eample 5 4 5 4 3 3 6 0.5 0. 0.5 0. 0.05 0 3 6 4 5