Ταξινομητές Νευρωνικών ικτύων

Αναγνώριση Προτύπων Ταξινομητές Νευρωνικών ικτύων Χριστόδουλος Χαμζάς 011 Τμήμα του περιεχομένου των παρουσιάσεων προέρχονται από τις παρουσιάσεις του αντίστοιχου διδακτέου μαθήματος του καθ. Σέργιο Θεοδωρίδη,, Τμ. Πληροφορικής και Τηλεπικοινωνιών, Πανεπιστήμιο Αθηνών

Γιατί μη γραμμικοί ταξινομητές; Πρόβλημα κατηγοριών: Αν ο αριθμός των προτύπων είναι μικρότερος από τον αριθμό των συνιστωσών κάθε προτύπου, τότε υπάρχει πάντα υπερεπίπεδο που τα διαχωρίζει. Επομένως οι γραμμικοί ταξινομητέςείναι χρήσιμοι: σε προβλήματα πολύ μεγάλης διαστατικότητας Σε προβλήματα μέτριας διαστατικότητας, όπου υπάρχει ένας σχετικά μικρός αριθμός προτύπων εκπαίδευσης Επιπλέον ο αριθμός των συνιστωσών ενός προτύπου μπορεί να αυξηθεί αυθαίρετα με την προσθήκη νέων συνιστωσών που είναι μη γραμμικές συναρτήσεις των αρχικών συνιστωσών (π.χ. πολυώνυμα) Υπάρχουν πολλά προβλήματα που δεν μπορούν να επιλυθούν με γραμμικούς ταξινομητές Στις προηγούμενες μεθόδους, η κύρια δυσκολία είναι η εύρεση της μη γραμμικής συνάρτησης ρη ης Μια κατηγορία μη γραμμικών ταξινομητών είναι και τα πολυστρωματικά νευρωνικά δίκτυα Σε αυτά η μορφή της μη γραμμικής συνάρτησης διαχωρισμού μαθαίνεται από τα δεδομένα εκμάθησης

Το απλό perceptron (γραμμικός ταξινομητής) Αρχιτεκτονική: Μονοστρωματικό δίκτυο με N εισόδους και Μ νευρώνες διατεταγμένους σε ένα στρώμα νευρώνων. Συναπτικές συνδέσεις συνδέουν όλους τους νευρώνες με όλες τις εισόδους. Νευρώνες: Τύπου McCulloch-Ptts με hard lmter και προσαρμοζόμενο κατώφλι ενεργοποίησης y sgn( wx w ) j j 0 j Δεδομένου ότι οι έξοδοι είναι ανεξάρτητες μεταξύ τους, μπορούμε να τις μελετήσουμε και ανεξάρτητα, α θεωρώντας κάθε νευρώνα του perceptron χωριστά: x 1 w w 1 x y sgn( w x w ) w n x n w 0-1 y Σ f=sgn +1 j sgn( wxw ) j j 0 3 0

Το πρόβλημα: Δίνεται ένα σύνολο P προτύπων { x, 1,,..., P} n διαμερισμένο σε κατηγορίες C { x, 1,,..., K} C { x, K 1, K,..., P} 1 που είναι γραμμικά διαχωρίσιμες, δηλαδή υπάρχει ένα διάνυσμα wx ˆ 0 x C, wx ˆ 0 x C 1 ŵ ώστε Το ζητούμενο είναι να βρεθεί ένα τέτοιο διάνυσμα που να διαχωρίζει γραμμικά τις κατηγορίες με επαναληπτικό τρόπο. 4

Η γεωμετρία του προβλήματος g ( x) w T x w0 x w 0 /w w x D w 0 /w 1 x 1 5

Motvaton Natural Systems perform very complex nformaton processng tass wth completely dfferent "hardware" than conventonal (von Neumann) computers The 100-step program constrant [Jerry Feldman} Neurons operate n 1ms Humans do sophstcated processng n 0.1s Only 100 seral steps massve parallelsm 6

Artfcal Neural Networ (ANN) 8

Defnton: An Artfcal Neural Networ (ANN) s an nformaton processng devce consstng of a large number of hghly nterconnected processng elements. Each processng element (unt) performs only very smple computatons. Remars: Each unt computes a sngle actvaton-value Envronmental nteracton through a subset of unts Behavor depends on nterconnecton structure Structure may adapt by learnng 9

The XOR problem ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ Non Lnear Classfers x 1 x XOR Class 0 0 0 B 0 1 1 A 1 0 1 A 1 1 0 B 10

There s no sngle lne (hyperplane) that separates class A from class B. On the contrary, AND and OR operatons are lnearly separable problems 11

The Two-Layer Perceptron For the XOR problem, draw two, nstead, of one lnes 1

Then class B s located outsde the shaded area and class A nsde. Ths s a two-phase desgn. Phase 1: Draw two lnes (hyperplanes) g1( x) g( x) 0 Each of them s realzed by a perceptron. outputs of the perceptrons wll be y 0 f ( g ( x )) 1, 1 dependng on the poston of x. The Phase : Fnd the poston of x w.r.t. both lnes, based on the values of y 1, y. 13

1 st phase nd x 1 x y 1 y phase 0 0 0(-) 0(-) B(0) 0 1 1(+) 0(-) A(1) 1 0 1(+) 0(-) A(1) 1 1 1(+) 1(+) B(0) Equvalently: The computatons of the frst phase T perform a mappng x y y, y ] [ 1 14

The decson s now performed on the transformed data. y g( y) 0 Ths can be performed va a second lne, whch can also be realzed by a perceptron. 15

Computatons of the frst phase perform a mappng that transforms the nonlnearly separable problem to a lnearly separable one. The archtecture 16

Ths s nown as the two layer (*) perceptron wth one hdden and one output layer. The actvaton functons are 0 0 f (.) 1 The neurons (nodes) of the fgure realze the followng lnes (hyperplanes) g g g 1 ( x ) x 1 x 0 3 ( x) x1 x 0 1 y) y1 y 0 1 ( (*) NOTE: Duba, Hart and Stor, n ther boo they call t a three layer perceptron. In general n ther notaton, what we call N-layer they call t (N+1)-Layer 17

Classfcaton capabltes of the two-layer perceptron The mappng performed by the frst layer neurons s onto the vertces of the unt sde square, e.g., (0, 0), (0, 1), (1, 0), (1, 1). The more general case, x R x l y 0, 1 1, p [ y1,... y p ], y,... T 18

performs a mappng of a vector onto the vertces of the unt sde H p hypercube The mappng s acheved wth p neurons each realzng a hyperplane. The output of each of these neurons s 0 or 1 dependng on the relatve poston of x w.r.t. the hyperplane. 19

Intersectons of these hyperplanes form regons n the l-dmensonal space. Each regon corresponds to a vertex of the H p unt hypercube. 0

For example, the 001 vertex corresponds to the regon whch s located to the (-) sde of g 1 (x)=0 to the (-) sde of g (x)=0 to the (+) sde of g 3 (x)=0 1

The output neuron realzes a hyperplane n the transformed y space, that separates some of the vertces from the others. Thus, the two layer perceptron has the capablty to classfy vectors nto classes that consst of unons of polyhedral regons. But NOT ANY unon. It depends on the relatve poston of the correspondng vertces.

Three layer-perceptrons The archtecture w j y j = f(net j ) Ths s capable to classfy vectors nto classes consstng of ANY unon of polyhedral regons. The dea s smlar to the XOR problem. It realzes p more than one planes n the space. y R 3

Three layer-perceptrons wth C classes The archtecture for more than classes w j y j = f(net j ) 4

A sngle bas unt s connected to each unt other than the nput unts d d t 0 xw j w j. x, 1 0 Net actvaton: net j xw j w j where the subscrpt ndexes unts n the nput layer, j n the hdden; w j denotes the nput-to-hdden layer weghts at the hdden unt j. (In neurobology, such weghts or connectons are aecalled aed synapses ) apses) Each hdden unt emts an output that s a nonlnear functon of ts actvaton, that s: y j =f(net) j 5

The reasonng For each vertex, correspondng to class, say A, construct a hyperplane whch leaves THIS vertex on one sde (+) and ALL the others to the other sde (-). The output neuron realzes an OR gate Overall: The frst layer of the networ forms the hyperplanes, the second layer forms the regons and the output neuron forms the classes. Desgnng Multlayer Perceptrons One drecton s to adopt the above ratonale and develop a structure that classfes correctly all the tranng patterns. The other drecton s to choose a structure and compute the synaptc weghts to optmze a cost functon. 6

Expressve Power of mult-layer Networs Queston: Can every decson be mplemented by a three-layer networ descrbed by equaton (1)? Answer: Yes (due to A. Kolmogorov) Any contnuous functon from nput to output can be mplemented n a three-layer net, gven suffcent number of hdden unts n H, proper nonlneartes, and weghts. g( x ) n 1 j j1 n ( x ) x I ( I [0,1];n ) j for properly chosen functons j and j 7

Each of the n+1 hdden unts j taes as nput a sum of d nonlnear functons, one for each nput feature x Each hdden unt emts a nonlnear functon j of ts total nput The output unt emts the sum of the contrbutons of the hdden unts Unfortunately: Kolmogorov s theorem tells us very lttle about how to fnd the nonlnear functons based on data; ths s the central problem n networ-based pattern recognton 8

Bacpropagaton Algorthm Any functon from nput to output t can be mplemented as a three-layer neural networ These results are of greater theoretcal nterest than practcal, snce the constructon of such a networ requres the nonlnear functons and the weght values whch are unnown! 9

(*) (*) (*) Στο σχήμα αυτό, αντικατέστησε two layer με one layer και three layer με two layer 30

Our goal now s to set the nterconnexon weghts based on the tranng patterns and the desred outputs In a three-layer networ, t s a straghtforward matter to understand how the output, and thus the error, depend on the hdden-to-output layer weghts The power of bacpropagaton s that t enables us to compute an effectve error for each hdden unt, and thus derve a learnng rule for the nput-to- hdden weghts, ths s nown as: 31

The Bacpropagaton Algorthm Ths s an algorthmc procedure that computes the synaptc weghts teratvely, so that an adopted cost functon s mnmzed (optmzed) In a large number of optmzng procedures, computaton of dervatves are nvolved. Hence, dscontnuous actvaton functons pose a problem,.e., 1 x 0 f ( x ) 0 x 0 There s always an escape path!!! The logstc functon 1 f ( x) 1 exp( ax) s an example. Other functons are also possble and n some cases more desrable. 3

The steps: Adopt an optmzng cost functon, e.g., Least Squares Error Relatve Entropy between desred responses and actual responses of the networ for the avalable tranng patterns. That s, from now on we have to lve wth errors. We only try to mnmze them, usng certan crtera. Adopt an algorthmc procedure for the optmzaton of the cost functon wth respect to the synaptc weghts e.g., Gradent descent Newton s algorthm Conjugate gradent 34

The tas s a nonlnear optmzaton one. For the gradent descent method r r r w 1 (new) w 1 (old) w 1 w r 1 J ww 1 r 1 η s the learnng rate 35

The Procedure: Intalze unnown weghts randomly wth small values. Compute the gradent terms bacwards, startng wth the weghts of the last (3 rd ) layer and then movng towards the frst Update the weghts Repeat the procedure untl a termnaton procedure s met Two major phlosophes: Batch mode: The gradents of the last layer are computed once ALL tranng data have appeared to the algorthm,.e., by summng up all error terms. Pattern mode: The gradents are computed every tme a new tranng data par appears. Thus gradents are based on successve ndvdual errors. 36

A major problem: The algorthm may converge to a local l mnmum 38

The Cost functon choce Examples: The Least Squares J N 1 E ( ) E( ) e ˆ m( ) ( ym( ) y m ( )) 1,,..., N m1 m1 Desred response of the m th output neuron (1 or 0) for x( y m () ) yˆ m () Actual response of the m th output neuron, n the nterval [0, 1], for nput x() 39

The cross-entropy N 1 J E ( ) E( ) ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ m1 y m ( ) ln yˆ m ( ) (1 y m ( )) ln(1 yˆ m ( )) Ths presupposes an nterpretaton of y and ŷ as probabltes Classfcaton error rate. Ths s also nown as dscrmnatve learnng. Most of these technques use a smoothed verson of the classfcaton error. 40

Remar 1: A common feature of all the above s the danger of local mnmum convergence. Well formed cost functons guarantee convergence to a good soluton, that s one that classfes correctly ALL tranng patterns, provded such a soluton exsts. The cross-entropy cost functon s awellformedone. The Least Squares s not. 41

Remar : Both, the Least Squares and the cross entropy lead to output values yˆ m ( ) that approxmate optmally class a-posteror probabltes!!! yˆ m ( ) P( m x( )) x() That s, the probablty of class m gven. Ths s a very nterestng result. It does not depend on the underlyng dstrbutons. It s a characterstc of certan cost functons. How good or bad s the approxmaton, depends on the underlyng model. Furthermore, t s only vald at the global mnmum. 4

net ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ Networ Learnng d d t xw w 0 xw w. x, 1 0 Let t be the -th target (or desred) output and z be the - th computed output wth = 1,, c and w represents all the weghts of the networ The tranng error:( Least Square) c 1 J( w ) ( t z 1 ) 1 t z The bacpropagaton learnng rule s based on gradent descent The weghts are ntalzed wth pseudo-random values and are changed n a drecton that wll reduce the error: w J w 43

where s the learnng rate whch ndcates the relatve sze of the change n weghts w(m +1) = w(m) + w(m) where m s the m-th pattern presented Error on the hdden to-output weghts J w j J net net. w j net w where the senstvty of unt s defned as: j J net and descrbes how the overall error changes wth the actvaton of the unt s net J J z. ( t z ) f' ( net ) net z net 44

Snce net =w t t.y therefore: net w j y j Concluson: the weght update (or learnng rule) for the hdden-to-output output weghts s: w j = y j = (t z )f (net )y j Error on the nput-to-hdden unts J w j J y y j. net net. w j j j j 45

ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ However, c c c j c j j net z y z z t z t y y J 1 1 ) ( ) ( 1 Smlarly as n the precedng case, we defne the j j w net f z t y net net z z t 1 1 ) ( ' ) (. ) ( y p g, senstvty for a hdden unt: c j j j w net f 1 ) ( ' whch means that: The senstvty at a hdden unt s smply the sum of the ndvdual senstvtes at the output 1 smply the sum of the ndvdual senstvtes at the output unts weghted by the hdden-to-output weghts w j ; all multpled by f (net j ) Concluson: The learnng rule for the nput-to-hdden weghts s: 46 j j j j x net f w x w j ) ( '

STOCHASTIC BACKPROPAGATION Startng wth a pseudo-random weght confguraton, the stochastc bacpropagaton p algorthm can be wrtten as: Begn ntalze n H ; w, crteron,, m 0 do m m + 1 x m randomly chosen pattern w j w j + j x ; w j w j + y j untl J(w) < return w End 47

BATCH BACKPROPAGATION Startng wth a pseudo-random weght confguraton, the batch bacpropagaton algorthm can be wrtten as: Begn ntalze n H ; w, crteron,, r 0 do r r + 1 (epoch) m 0; Δw j 0; Δw j 0 do m m + 1 x m select pattern Δw j Δw j + j x ; Δw j Δw j + y j untl m=n w j w j + j x ; w j w j + y j untl J(w) < return w End 48

Stoppng crteron ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ The algorthm termnates when the change n the crteron functon J(w) s smaller than some preset value There are other stoppng crtera that lead to better performance than ths one So far, we have consdered the error on a sngle pattern, but we want to consder an error defned over the entrety of patterns n the tranng set The total tranng error s the sum over the errors of n ndvdual patterns n J p1 J p (1) 49

Stoppng crteron (cont.) A weght update may reduce the error on the sngle pattern beng presented but can ncrease the error on the full tranng set However, gven a large number of such ndvdual updates, the total error of equaton (1) decreases 50

Tranng set: Defntons A set of examples used for learnng, that s to ft the parameters [.e., weghts] of the classfer. Valdaton set: A set of examples used to tune the parameters [.e., archtecture, not weghts] of a classfer, for example to choose the number of hdden unts n a neural networ. Test set: A set of examples used only to assess the performance [generalzaton] of a fully-specfed classfer.

HOLD OUT METHOD Snce our goal s to fnd the networ havng the best performance on new data, the smplest approach to the comparson of dfferent networs s to evaluate the error functon usng data whch s ndependent of that used for tranng. Varous networs are traned by mnmzaton of an approprate error functon defned wth respect to a tranng data set. The performance of the networs s then compared by evaluatng the error functon usng an ndependent valdaton set, and the networ havng the smallest error wth respect to the valdaton set s selected. Ths approach s called the hold out method. Snce ths procedure can tself lead to some overfttng to the valdaton set, the performance of the selected networ should be confrmed by measurng ts performance on a thrd ndependent set of data called a test set. The crucal pont s that t a test t set s never used to choose among two or more networs, so that the error on the test set provdes an unbased estmate of the generalzaton error. Any data set that s used to choose the best of two or more networs s, by defnton, a valdaton set, and the error of the chosen networ on the valdaton set s optmstcally based. Read more: http://www.faqs.org/faqs/a-faq/neural-nets/part1/secton-14.html#xzz0pxwwwvhy

Learnng Curves Before tranng starts, the error on the tranng set s hgh; through the learnng process, the error becomes smaller The error per pattern depends on the amount of tranng data and the expressve power (such as the number of weghts) n the networ The average error on an ndependent test set s always hgher than on the tranng set, and t can decrease as well as ncrease A valdaton set s used n order to decde when to stop tranng ; we do not want to overft the networ and decrease the power of the classfer generalzaton we stop tranng at a mnmum of the error on the valdaton set 53

REGULARIZATION Choce of the networ sze. How bg a networ can be. How many layers and how many neurons per layer?? There are two major drectons 55

Prunng Technques These technques start from a large networ and then weghts and/or neurons are removed teratvely, accordng to a crteron. 56

ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ Methods based on parameter senstvty j j j w w h w h w g J 1 1 + hgher order terms where j j w w J h, w J g Near a mnmum and assumng that j 1 h w J 57

Prunng s now acheved n the followng procedure: Tran the networ usng Bacpropagaton for a number of steps Compute the salences s hw Remove weghts wth small s. Repeat the process Methods based on functon regularzaton J N 1 E( ) ae p ( w) 58

The second term favours small values for the weghts, e.g., E p ( ) h( w ) h( w ) w w 0 w w 0 1 where After some tranng steps, weghts wth small values are removed. 59

Constructve technques They start wth a small networ and eep ncreasng t, accordng to a predetermned d procedure and crteron. 60

Remar: Why not start wth a large networ and leave the algorthm to decde whch weghts are small?? Ths approach s just naïve. It overloos that classfers must have good generalzaton propertes. A large networ can result n small errors for the tranng set, snce t can learn the partcular detals of the tranng set. On the other hand, t wll not be able to perform well when presented wth data unnown to t. The sze of the networ must be: Large enough to learn what maes data of the same class smlar and data from dfferent classes dssmlar Small enough not to be able to learn underlyng dfferences between data of the same class. Ths leads to the so called overfttng. 61

Example: 6

Overtranng s another sde of the same con,.e., the networ adapts to the peculartes of the tranng set. 63

Generalzed Lnear Classfers Remember the XOR problem. The mappng f (.) x f ( g1 ( x)) y f ( g( x)) The actvaton functon transforms the nonlnear tas nto a lnear one. In the more general case: l Let and a nonlnear classfcaton tas. x R f (.), 1,,..., 64

Are there any functons and an approprate, so that the mappng x y f ( x 1 )... f ( x) transforms the tas nto a lnear one, n the space? If ths s true, then there exsts a hyperplane so that y R w R If w w T 0 w y 0, x T 0 w y 0, x 1 65

In such a case ths s equvalent wth approxmatng the nonlnear dscrmnant functon g(x), n terms of (x),.e., f g ( x ) w0 w f ( x ) ( ) 0 1 Gven f ( x), the tas of computng the weghts s a lnear one. 0 How sensble s ths?? From the numercal analyss pont of vew, ths s justfed f f (x) are nterpolaton functons. From the Pattern Recognton pont of vew, ths s justfed by Cover s theorem 66

Capacty of the l-dmensonall space n Lnear Dchotomes l Assume N ponts n R assumed to be n general poston, thatt s: Not 1 of these le on a 1 dmensonal space 67

Cover s theorem states: The number of groupngs that can be formed by (l-1)-dmensonal hyperplanes to separate N ponts n two classes s O( N, l) N 1, l N N 1 ( N 1)! ( N 1 )!! 0 Example: N=4, l=, O(4,)=14 Notce: The total number of possble groupngs s 4 =16 68

Probablty of groupng N ponts n two lnearly separable classes s O( N, l) l P N N N r( l 1) 69

Thus, the probablty of havng N ponts n lnearly separableable classes tends to 1, for large l, po provded N<( +1) l Hence, by mappng to a hgher dmensonal space, we ncrease the probablty of lnear separablty, provded the space s not too densely populated. 70

Radal Bass Functon Networs (RBF) Choose 71

f ( x ) exp x c Equvalent to a sngle layer networ, wth RBF actvatons and lnear output node. 7

ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ Example: The XOR problem Defne: 0 1 1 0 0 1 1 1 1, c, c ) exp( ) exp( 1 c x c x y 0 135 1 1 1, 1 0.135 0 0 0.368 0.368 1 0, 0.368 0.368 0 1 0.135 1 1 0 73

g( y) y1 y 1 0 g( x) exp( x c1 ) exp( x c ) 1 0 74

Tranng of the RBF networs Fxed centers: Choose centers randomly among the data ponts. Also fx σ s. Then g ( x ) w 0 w T y s a typcal lnear classfer desgn. Tranng of the centers: Ths s a nonlnear optmzaton tas Combne supervsed and unsupervsed learnng procedures. The unsupervsed up part reveals clusterng tendences of the data and assgns the centers at the cluster representatves. 75

Unversal Approxmators It has been shown that any nonlnear contnuous functon can be approxmated arbtrarly close, both, by a two layer perceptron, wth sgmod actvatons, and an RBF networ, provded a large enough number of nodes s used. Multlayer Perceptrons vs. RBF networs MLP s nvolve actvatons of global nature. All ponts on a plane w T x c gve the same response. RBF networs have actvatons of a local nature, due to the exponental decrease as one moves away from the centers. MLP s learn slower but have better generalzaton propertes 76

Support Vector Machnes: The non-lnear case Recall that the probablty of havng lnearly separable classes ncreases as the dmensonalty of the feature vectors ncreases. Assume the mappng: x R l y R, l Then use SVM n R Recall that n ths case the dual problem formulaton wll be maxmze ( where y 1 N T j y y j y y ) j 1, j R 77

Also, the classfer wll be g T ( y) w y w0 where x N s y 1 y R y y Thus, nner products n a hgh dmensonal space are nvolved, hence Hgh complexty 78

Somethng clever: Compute the nner products n the hgh dmensonal space as functons of nner products performed n the low dmensonal space!!! Is ths POSSIBLE?? Yes. Here s an example Let Let x x x 1 y, x T R x x 1 x x 1 R Then, t s easy to show that y T y j T ( x x j ) 3 79

Mercer s Theorem ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ Let x ( x) H Then, the nner product n H r r ( x ) ( y ) K ( x, y ) r where Κ ( x, y ) g ( x ) g ( y ) d xd y 0 for any g(x), x: g ( x) d x K(x,y) symmetrc functon nown as ernel. 80

The opposte s also true. Any ernel, wth the above propertes, corresponds to an nner product n SOME space!!! Examples of ernels Radal bass Functons: Polynomal: K ( x, z ) exp x z Hyperbolc Tangent: T q K( x, z) ( x z 1), q K( x, z) tanh( x z ) for approprate values of β, γ. T 0 81

SVM Formulaton ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ Step 1: Choose approprate ernel. Ths mplctely assumes a mappng to a hgher dmensonal (yet, not nown) space. Step : 1 max ( j y subject to : 0, j C, y 0 y j K ( x, x j 1,,..., N )) Ths results to an mplct combnaton w N s y 1 ( x ) 8

Step 3: Assgn x to N s 1( ) f g( x) y Κ( x, x) w0 ( ) 0 1 The SVM Archtecture 83

Decson Trees ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ Ths sa famly of non-lnear classfers. They are multstage t decson systems, n whch classes are sequentally rejected, untl a fnally accepted class s reached. To ths end: The feature space s splt nto unque regons n a sequental manner. Upon the arrval of a feature vector, sequental decsons, assgnng features to specfc regons, are performed along a path of nodes of an approprately p constructed tree. The sequence of decsons s appled to ndvdual features, and the queres performed n each node are of the type: s feature x a where α s a pre-chosen (durng tranng) threshold. 85

The fgures below are such examples. Ths type of trees s nown as Ordnary Bnary Classfcaton Trees (OBCT). The decson hyperplanes, splttng the space nto regons, are parallel to the axs of the spaces. Other types of partton are also possble, yet less popular. 86

Desgn Elements that defne a decson tree. Eachnode,t, s assocated wth a subset Χ t X, where X s the tranng set. At each node, X t s splt nto two (bnary splts) dsjont descendant subsets X t,y and X t,n, where X t,y X t,n = Ø X t,y X t,n = X t X t,y s the subset of X t for whch the answer to the query at node t s YES. X tn t,n s the subset correspondng to NO. The splt s decded accordng to an adopted queston (query). 87

Asplttng crteron must be adopted for the best splt of X t nto X t,y and X t,n. A stop-splttng splttng crteron must be adopted that controls the growth of the tree and a node s declared as termnal (leaf). A rule s requred that assgns each (termnal) leaf to a class. 88

Set of Questons: In OBCT trees the set of questons s of the type x a s? The choce of the specfc x andthevalueofthethresholdα, for each node t, are the results of searchng, durng tranng, among the features and a set of possble threshold values. The fnal combnaton s the one that results to the best value of a crteron. 89

Splttng Crteron: The man dea behnd splttng at each node s the resultng descendant subsets X t,y and X t,n to be more class homogeneous compared to X t. Thus the crteron must be n harmony wth such a goal. A commonly used crteron s the node mpurty: M 1 tlog P t I( t) P and P t N t where s the number of data ponts n X t that belong to class. The decrease n node mpurty s defned as: Nt, Nt, N I( t) I( t) I( t ) I( tn ) N N t N N t t t t 90

The goal s to choose the parameters n each node (feature and threshold) that result n a splt wth the hghesth decrease n mpurty. Why hghest decrease? Observe that the hghest value of I(t) s acheved f all classes are equprobable,.e., X t s the least homogenous. Stop - splttng rule. Adopt a threshold T andstopsplttnga node (.e., assgn t as a leaf),fthempurtydecreasesless than T. That s, node t s pure enough. Class Assgnment Rule: Assgn a leaf to a class j, where: j arg max P( t) 91

Summary of an OBCT algorthmc scheme: 9

Remars: ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ A crtcal factor n the desgn s the sze of the tree. Usually one grows a tree to a large sze and then apples varous prunng technques. Decson trees belong to the class of unstable classfers. Ths can be overcome by a number of averagng technques. Baggng s a popular technque. Usng bootstrap technques n X, varous trees are constructed, T, =1,,, B. The decson s taen accordng to a majorty votng rule. 93

Combnng Classfers The basc phlosophy behnd the combnaton of dfferent classferss les n the fact that eenthe even best classfer fals n some patterns that other classfers may classfy correctly. Combnng classfers ams at explotng ths complementary nformaton resdng n the varous classfers. Thus, one desgns dfferent optmal classfers and then combnes the results wth a specfc rule. Assume that each of the, say, L desgned classfers provdes at ts output the posteror probabltes: P( x), 1,,..., M 94

Product Rule: Assgn x to the class : P where j th classfer. j L arg max P x x j1 SumRule:Assgn x to the class : j s the respectve posteror probablty of the arg max P j x L j1 j 95

Majorty Votng Rule: Assgn x to the class for whch there s a consensus or when at least c of the classfers agree on the class label of x where: L 1, L even c L 1, L odd otherwse the decson s rejecton, that s no decson s taen. Thus, correct decson s made f the majorty of the classfers agree on the correct label, and wrong decson f the majorty agrees n the wrong label. 96

Dependent or not Dependent classfers? Although there are not general theoretcal results, expermental evdence has shown that the more ndependent n ther decson the classfers are, the hgher the expectaton should be for obtanng mproved results after combnaton. However, there s no guarantee that combnng classfers results n better performance compared to the best one among the classfers. Towards Independence: A number of Scenaros. Tran the ndvdual classfers usng dfferent tranng data ponts. To ths end, choose among a number of possbltes: Bootstrappng: Ths s a popular technque to combne unstable classfers such as decson trees (Baggng belongs to ths category of combnaton). 97

Stacng:Tran the combner wth data ponts that have been excluded from the set used to tran the ndvdual classfers. Use dfferent subspaces to tran ndvdual classfers: Accordng to the method, each ndvdual classfer operates n a dfferent feature subspace. That s, use dfferent features for each classfer. Remars: The majorty votng and the summaton schemes ran among the most popular combnaton schemes. Tranng ndvdual classfers n dfferent subspaces seems to lead to substantally better mprovements compared to classfers operatng n the same subspace. Besdes the above three rules, other alternatves are also possble, such as to use the medan value of the outputs of ndvdual classfers. 98

The Boostng Approach The orgns: Is t possble a wea learnng algorthm (one that performs slghtly better than a random guessng) to be boosted nto a strong algorthm? (Vllant 1984). The procedure to acheve t: Adopt a wea classfer nown as the base classfer. Employng the base classfer, desgn a seres of classfers, n a herarchcal fashon, each tme employng a dfferent weghtng g of the tranng samples. Emphass n the weghtng s gven on the hardest samples,.e., the ones that eep falng. Combne the herarchcally desgned d classfers by a weghted average procedure. 99

The AdaBoost Algorthm. Construct an optmally desgned classfer of the form: where: where x; bnary class label: s a parameter vector. f ( x) sgn F( x) K F ( x ) a x ; 1 denotes the base classfer that returns a ; 1,1 x 100

The essence of the method. Desgn the seres of classfers: x, x;,..., x; The parameter vectors ; 1, 1,,..., K are optmally computed so as: To mnmze the error rate on the tranng set. Each tme, the tranng samples are re-weghted so that the weght of each sample depends on ts hstory. Hard samples that nsst on falng to be predcted correctly, by the prevously desgned classfers, are more heavly weghted. 101

Updatng the weghts for each sample w m ( m 1) w exp y a Z m x ; x, 1,,..., N Z m s a normalzng factor common for all samples. a m 1 1 P ln P m m m where P m <0.5 (by assumpton) s the error rate of the optmal classfer x; m at stage m. Thus α m >0. The term: ep exp y a m x ; taes a large value f x m (wrong classfcaton) and a small value n the case of correct classfcaton m y ; 0} { y x m m ; 0 The update equaton s of a multplcatve nature. That s, successve large values of weghts (hard samples) result n larger weght for the next teraton 10

The algorthm 103

Remars: ΔΗΜΟΚΡΙΤΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΡΑΚΗΣ Tranng error rate tends to zero after a few teratons. The test error levels to some value. AdaBoost s greedy n reducng the margn that samples leave from the decson surface. 104