ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ Μηχανική Μάθηση Ενότητα 10: Support Vector Machnes Ιωάννης Τσαμαρδίνος Τμήμα Επιστήμης Υπολογιστών
Support Vector Machnes Decson surface s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) Arguably, the most mportant recent dscovery n machne learnng In a nutshell: map the data to a predetermned very hghdmensonal space va a kernel functon Fnd the hyperplane that maxmzes the margn between the two classes If data are not separable fnd the hyperplane that maxmzes the margn and mnmzes the (a weghted average of the) msclassfcatons 2
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 3
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 4
Whch Separatng Hyperplane to Use? Var 1 Var 2 5
Maxmzng the Margn Var 1 IDEA 1: Select the separatng hyperplane that maxmzes the margn! Margn Wdth Margn Wdth Var 2 6
Why Maxmze the Margn? Intutvely ths feels safest. It seems to be the most robust to the estmaton of the decson boundary. LOOCV s easy snce the model s mmune to removal of any nonsupport-vector dataponts. Theory suggests (usng VC dmenson) that s related to (but not the same as) the proposton that ths s a good thng. It works very well emprcally. 7
Why Maxmze the Margn? Perceptron convergence theorem (Novkoff 1962): Let s be the smallest radus of a (hyper)sphere enclosng the data. Suppose there s a w that separates the data,.e., wx>0 for all x wth class 1 and wx<0 for all x wth class -1. Let m be the separaton margn of the data Let learnng rate be 0.5 for the learnng rule ' w w ( t Then, the number of updates made by the perceptron learnng algorthm on the data s at most (s/m) 2 d o d ) x d 8
Support Vectors Var 1 Support Vectors Margn Wdth Var 2 9
Settng Up the Optmzaton Problem Var 1 The wdth of the margn s: 2 k w w x b k k w x b k k Var 2 w x b 0 w So, the problem s: 2 k max w s. t. ( w x b) k, x of class 1 ( w x b) k, x of class 2 10
Settng Up the Optmzaton Problem Var 1 There s a scale and unt for data so that k=1. Then problem becomes: w r r w x b 1 1 r r w x b 1 1 Var 2 w x b 0 2 max w s. t. ( w x b) 1, x of class 1 ( w x b) 1, x of class 2 11
Settng Up the Optmzaton Problem If class 1 corresponds to 1 and class 2 corresponds to -1, we can rewrte as ( w x b) 1, x wth y 1 ( w x b) 1, x wth y 1 y ( w x b) 1, x So the problem becomes: 2 max w s. t. y ( w x b) 1, x or 1 2 mn w 2 s. t. y ( w x b) 1, x 12
Lnear, Hard-Margn SVM Formulaton Fnd w,b that solves 1 2 mn w 2 s. t. y ( w x b) 1, x Problem s convex so, there s a unque global mnmum value (when feasble) There s also a unque mnmzer,.e. weght and b value that provdes the mnmum Non-solvable f the data s not lnearly separable 13
Solvng Lnear, Hard-Margn SVM Quadratc Programmng QP s a well-studed class of optmzaton algorthms to maxmze a quadratc functon of some real-valued varables subject to lnear constrants. Very effcent computatonally wth modern constrant optmzaton engnes (handles thousands of constrants and tranng nstances). 14
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 15
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 16
Non-Lnearly Separable Data Var 1 Fnd hyperplane that mnmze both w and the number of msclassfcatons: w +C*#errors w r r w x b 1 1 r r w x b 1 1 Var 2 w x b 0 Problem: NPcomplete Plus, all errors are treated the same 17
Non-Lnearly Separable Data Var 1 Mnmze w +C*{dstance of error ponts from ther desred place} w r r w x b 1 Allow some nstances to fall wthn the margn, but penalze them r r w x b 1 1 1 w x b Var 2 0 18
Non-Lnearly Separable Data Var 1 Introduce slack varables w r r w x b 1 Allow some nstances to fall wthn the margn, but penalze them r r w x b 1 1 1 w x b 0 Var 2 19
Formulatng the Optmzaton Problem Constrants becomes : Var 1 y ( w x b) 1, x 0 w r r w x b 1 1 r r w x b 1 1 Var 2 w x b 0 Objectve functon penalzes for msclassfed nstances and those wthn the margn 1 mn 2 C trades-off margn wdth and msclassfcatons 20 2 w C
Lnear, Soft-Margn SVMs 1 mn 2 2 w C Algorthm tres to mantan to zero whle maxmzng margn Notce: algorthm does not mnmze the number of msclassfcatons (NP-complete problem) but the sum of dstances from the margn hyperplanes Other formulatons use 2 nstead y ( w x b) 1, x 0 As C, we get closer to the hard-margn soluton Hard-margn decson varables = m+1, #constrants = n Soft-margn decson varables = m+1+n, #constrants=2n 21
Robustness of Soft vs Hard Margn SVMs Var 1 Var 1 w x b 0 Soft Margn SVN Var 2 w x b 0 Hard Margn SVN Var 2 22
Soft vs Hard Margn SVM Soft-Margn always have a soluton Soft-Margn s more robust to outlers Smoother surfaces (n the non-lnear case) Hard-Margn does not requre to guess the cost parameter (requres no parameters at all) 23
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 24
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 25
Dsadvantages of Lnear Decson Surfaces Var 1 Var 2 26
Advantages of Non-Lnear Surfaces Var 1 Var 2 27
Lnear Classfers n Hgh- Dmensonal Spaces Var 1 Constructed Feature 2 Var 2 Constructed Feature 1 Fnd functon (x) to map to a dfferent space 28
Mappng Data to a Hgh-Dmensonal Space Fnd functon (x) to map to a dfferent space, then SVM formulaton becomes: 1 mn 2 2 w C 0 Data appear as (x), weghts w are now weghts n the new space Explct mappng expensve f (x) s very hgh dmensonal Solvng the problem wthout explctly mappng the data s desrable s. t. y ( w ( x) b) 1, x 29
The Dual of the SVM Formulaton Orgnal SVM formulaton n nequalty constrants n postvty constrants n number of varables The (Wolfe) dual of ths problem one equalty constrant n postvty constrants n number of varables (Lagrange multplers) Objectve functon more complcated NOTICE: Data only appear as (x ) (x j ) 1 mn 2 s. t. 0 mn w, b 2 a, j y ( w ( x) b) 1, x 1 s t. w y 2. C C 0, x y 0 j y j ( ( x ) ( x j )) 30
The Kernel Trck (x ) (x j ): means, map data nto new space, then take the nner product of the new vectors We can fnd a functon such that: K(x, x j ) = (x ) (x j ) easly computable Then, we do not need to explctly map the data nto the hgh-dmensonal space to solve the optmzaton problem (for tranng) How do we classfy wthout explctly mappng the new nstances? Turns out sgn( wx b) sgn( where b solves ( y for any j wth 0 C j j y K( x, x) b) j y K( x, x j ) b 1) 0, 31
Examples of Kernels Assume we measure two quanttes, e.g. expresson level of genes TrkC and SoncHedghog (SH) and we use the mappng: : x TrkC, x SH Consder the functon: We can verfy that: ( x) ( z) x 2 TrkC ( x z 2 TrkC TrkC z x TrkC 2 SH z 2 SH x SH { x 2 TrkC, x 2 SH K( x z) ( x z 1) 2x z SH TrkC 1) x 2 SH z TrkC z, SH ( x z 1) 2x 2 2 x TrkC TrkC x z SH TrkC, x K( x, z) TrkC x SH, x z SH SH,1} 1 32
Polynomal and Gaussan Kernels K ( x, z) ( x z 1) s called the polynomal kernel of degree p. For p=2, f we measure 7,000 genes usng the kernel once means calculatng a summaton product wth 7,000 terms then takng the square of ths number Mappng explctly to the hgh-dmensonal space means calculatng approxmately 50,000,000 new features for both tranng nstances, then takng the nner product of that (another 50,000,000 terms to sum) In general, usng the Kernel trck provdes huge computatonal savngs over explct mappng! Another commonly used Kernel s the Gaussan (maps to a dmensonal space wth number of dmensons equal to the number of tranng cases): K( x, z) exp( p x z 2 / 2 ) 33
The Mercer Condton Is there a mappng (x) for any symmetrc functon K(x,z)? No The SVM dual formulaton requres calculaton K(x, x j ) for each par of tranng nstances. The array G j = K(x, x j ) s called the Gram matrx There s a feature space (x) when the Kernel s such that G s always sem-postve defnte (Mercer condton) 34
Support Vector Machnes Three man deas: 1. Defne what an optmal hyperplane s (n way that can be dentfed n a computatonally effcent way): maxmze margn 2. Extend the above defnton for non-lnearly separable problems: have a penalty term for msclassfcatons 3. Map data to hgh dmensonal space where t s easer to classfy wth lnear decson surfaces: reformulate problem so that data s mapped mplctly to ths space 35
Other Types of Kernel Methods SVMs that perform regresson SVMs that perform clusterng -Support Vector Machnes: maxmze margn whle boundng the number of margn errors Leave One Out Machnes: mnmze the bound of the leave-one-out error SVM formulatons that take nto consderaton dfference n cost of msclassfcaton for the dfferent classes Kernels sutable for sequences of strngs, or other specalzed kernels 36
Varable Selecton wth SVMs Recursve Feature Elmnaton Tran a lnear SVM Remove the varables wth the lowest weghts (those varables affect classfcaton the least), e.g., remove the lowest 50% of varables Retran the SVM wth remanng varables and repeat untl classfcaton s reduced Very successful Other formulatons exst where mnmzng the number of varables s folded nto the optmzaton problem Smlar algorthm exst for non-lnear SVMs Some of the best and most effcent varable selecton methods 37
Comparson wth Neural Networks Neural Networks Hdden Layers map to lower dmensonal spaces Search space has multple local mnma Tranng s expensve Classfcaton extremely effcent Requres number of hdden unts and layers Very good accuracy n typcal domans SVMs Kernel maps to a very-hgh dmensonal space Search space has a unque mnmum Tranng s extremely effcent Classfcaton extremely effcent Kernel and cost the two parameters to select Very good accuracy n typcal domans Extremely robust 38
Why do SVMs Generalze? Even though they map to a very hghdmensonal space They have a very strong bas n that space The soluton has to be a lnear combnaton of the tranng nstances Large theory on Structural Rsk Mnmzaton provdng bounds on the error of an SVM Typcally the error bounds too loose to be of practcal use 39
MultClass SVMs One-versus-all Tran n bnary classfers, one for each class aganst all other classes. Predcted class s the class of the most confdent classfer One-versus-one Tran n(n-1)/2 classfers, each dscrmnatng between a par of classes Several strateges for selectng the fnal classfcaton based on the output of the bnary SVMs Truly MultClass SVMs Generalze the SVM formulaton to multple categores More on that n the nomnated for the student paper award: Methods for Mult-Category Cancer Dagnoss from Gene Expresson Data: A Comprehensve Evaluaton to Inform Decson Support System Development, Alexander Statnkov, Constantn F. Alfers, Ioanns Tsamardnos 40
Conclusons SVMs express learnng as a mathematcal program takng advantage of the rch theory n optmzaton SVM uses the kernel trck to map ndrectly to extremely hgh dmensonal spaces SVMs extremely successful, robust, effcent, and versatle whle there are good theoretcal ndcatons as to why they generalze well 41
Suggested Further Readng http://www.kernel-machnes.org/tutoral.html C. J. C. Burges. A Tutoral on Support Vector Machnes for Pattern Recognton. Knowledge Dscovery and Data Mnng, 2(2), 1998. P.H. Chen, C.-J. Ln, and B. Schölkopf. A tutoral on nu -support vector machnes. 2003. N. Crstann. ICML'01 tutoral, 2001. K.-R. Müller, S. Mka, G. Rätsch, K. Tsuda, and B. Schölkopf. An ntroducton to kernel-based learnng algorthms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF) B. Schölkopf. SVM and kernel methods, 2001. Tutoral gven at the NIPS Conference. Haste, Tbshran, Fredman, The Elements of Statstcal Learnng, Sprngel 2001 42
Τέλος Ενότητας
Χρηματοδότηση Το παρόν εκπαιδευτικό υλικό έχει αναπτυχθεί στα πλαίσια του εκπαιδευτικού έργου του διδάσκοντα. Το έργο «Ανοικτά Ακαδημαϊκά Μαθήματα στο Πανεπιστήμιο Κρήτης» έχει χρηματοδοτήσει μόνο τη αναδιαμόρφωση του εκπαιδευτικού υλικού. Το έργο υλοποιείται στο πλαίσιο του Επιχειρησιακού Προγράμματος «Εκπαίδευση και Δια Βίου Μάθηση» και συγχρηματοδοτείται από την Ευρωπαϊκή Ένωση (Ευρωπαϊκό Κοινωνικό Ταμείο) και από εθνικούς πόρους. 2
Σημειώματα
Σημείωμα αδειοδότησης (1) Το παρόν υλικό διατίθεται με τους όρους της άδειας χρήσης Creatve Commons Αναφορά, Μη Εμπορική Χρήση, Όχι Παράγωγο Έργο 4.0 [1] ή μεταγενέστερη, Διεθνής Έκδοση. Εξαιρούνται τα αυτοτελή έργα τρίτων π.χ. φωτογραφίες, διαγράμματα κ.λ.π., τα οποία εμπεριέχονται σε αυτό και τα οποία αναφέρονται μαζί με τους όρους χρήσης τους στο «Σημείωμα Χρήσης Έργων Τρίτων». [1] http://creatvecommons.org/lcenses/by-nc-nd/4.0/ 4
Σημείωμα αδειοδότησης (2) Ως Μη Εμπορική ορίζεται η χρήση: που δεν περιλαμβάνει άμεσο ή έμμεσο οικονομικό όφελος από την χρήση του έργου, για το διανομέα του έργου και αδειοδόχο που δεν περιλαμβάνει οικονομική συναλλαγή ως προϋπόθεση για τη χρήση ή πρόσβαση στο έργο που δεν προσπορίζει στο διανομέα του έργου και αδειοδόχο έμμεσο οικονομικό όφελος (π.χ. διαφημίσεις) από την προβολή του έργου σε διαδικτυακό τόπο Ο δικαιούχος μπορεί να παρέχει στον αδειοδόχο ξεχωριστή άδεια να χρησιμοποιεί το έργο για εμπορική χρήση, εφόσον αυτό του ζητηθεί..
Σημείωμα Αναφοράς Copyrght Πανεπιστήμιο Κρήτης, Ιωάννης Τσαμαρδίνος 2015. «Μηχανική Μάθηση. Support Vector Machnes». Έκδοση: 1.0. Ηράκλειο 2015. Διαθέσιμο από τη δικτυακή διεύθυνση: https://opencourses.uoc.gr/courses/course/vew.php?d= 362.
Διατήρηση Σημειωμάτων Οποιαδήποτε αναπαραγωγή ή διασκευή του υλικού θα πρέπει να συμπεριλαμβάνει: το Σημείωμα Αναφοράς το Σημείωμα Αδειοδότησης τη δήλωση Διατήρησης Σημειωμάτων το Σημείωμα Χρήσης Έργων Τρίτων (εφόσον υπάρχει) μαζί με τους συνοδευόμενους υπερσυνδέσμους.