Neural'Networks' Robot Image Credit: Viktoriya Sukhanova 123RF.com

Neural'Networks' These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric. Robot Image Credit: Viktoriya Sukhanova 3RF.com '

Neural'Networks' Origins:'Algorithms'that'try'to'mimic'the'brain' 4s'and'5s:'Hebbian'learning'and'Perceptron' Perceptrons'book'in'99'and'the'XOR'problem' Very'widely'used'in'8s'and'early'9s;'popularity' diminished'in'late'9s.' Recent'resurgence:'State@of@the@art'technique'for' many'applicaons' Arficial'neural'networks'are'not'nearly'as'complex' or'intricate'as'the'actual'brain'structure' Based'on'slide'by'Andrew'Ng' 8'

Single'Node' bias'unit ' x x = x = 4 x x x x 3 3 5 = 4 3 3 5 3 X h (x) =g ( x) = +e T x Sigmoid'(logisc)'acvaon'funcon:' g(z) = +e z Based'on'slide'by'Andrew'Ng' 9'

Neural'Network' bias'units' x a () h (x) Layer'' (Input'Layer)' Layer'' (Hidden'Layer)' Layer'3' (Output'Layer)' Slide'by'Andrew'Ng' '

Neural'networks'Terminology' Output units Hidden units Layered feed-forward network Input units Neural'networks'are'made'up'of'nodes'or'units,' connected'by'links' Each'link'has'an'associated'weight'and'ac,va,on'level' Each'node'has'an'input'func,on'(typically'summing'over' weighted'inputs),'an'ac,va,on'func,on,'and'an'output' Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par' '

Feed-Forward Process' Input'layer'units'are'set'by'external'data,'which' causes'their'output'links'to'be'ac,vated'at'the' specified'level' Working'forward'through'the'network,'the'input' func,on'of'each'unit'is'applied'to'compute'the'input' value' Usually'this'is'just'the'weighted'sum'of'the'acvaon'on' the'links'feeding'into'this'node' The'ac,va,on'func,on'transforms'this'input' funcon'into'a'final'value' Typically'this'is'a'nonlinear'funcon,'onen'a'sigmoid' funcon'corresponding'to'the' threshold 'of'that'node' Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par' 3'

() () h (x) a i (j) = acvaon 'of'unit'i''in'layer'j Θ (j) = weight'matrix'stores'parameters' from'layer'j to'layer'j +' If'network'has's j 'units'in'layer'j and(s j+ units'in'layer'j+,' then'θ (j) has'dimension's j+ (s j +)'''''''''''''''''''''''''''''''.' Slide'by'Andrew'Ng' () R 3 4 () R 4 4'

Vectorizaon' a () = g z () = g () x + () x + () x + () 3 x 3 a () = g z () = g () x + () x + () x + () 3 x 3 a () = g z () 3 = g () 3 x + () 3 x + () 3 x + () 33 x 3 3 h (x) =g () a() + () a() + () a() + () 3 a() 3 = g z Feed@Forward'Steps:' z () = () x Based'on'slide'by'Andrew'Ng' () () h (x) a () = g(z () ) Add a () = z = () a () h (x) =a = g(z ) 5'

Other'Network'Architectures' s'='[3,'3,',']' h (x) Layer' ' Layer' ' Layer'3 ' Layer'4 ' L'denotes'the'number'of'layers' ' s N +L (('''''contains'the'numbers'of'nodes'at'each'layer Not'counng'bias'units' Typically,'s = d''(#'input'features)'and's L- =K''(#'classes)' '

Mulple'Output'Units:''One@vs@Rest' ' Pedestrian' Car' Motorcycle' Truck' h (x) R K when'pedestrian''''''''''''when'car''''''''''''''when'motorcycle'''''''''''''when'truck' h (x) 4 3 5 h (x) 4 3 5 h (x) 4 3 5 h (x) 4 3 5 We'want:' Slide'by'Andrew'Ng'

Mulple'Output'Units:''One@vs@Rest' Given'{(x,y ), (x,y ),..., (x n,y n )} Must'convert'labels'to'@of@K'representaon' e.g.,''''''''''''''''''''when'motorcycle,''''''''''''''''''''''when'car,'etc.'' 8' h (x) R K when'pedestrian''''''''''''when'car''''''''''''''when'motorcycle'''''''''''''when'truck' h (x) 4 3 5 h (x) 4 3 5 h (x) 4 3 5 h (x) 4 3 5 We'want:' y i = 4 3 5 y i = 4 3 5 Based'on'slide'by'Andrew'Ng'

Neural'Network'Classificaon' Given:' {(x,y ), (x,y ),..., (x n,y n )} ' s N ((''contains'#'nodes'at'each'layer +L s = d''(#'features)' Binary'classificaon' ''y'=''or'' ' '''output'unit'(s L- = )' ' ' Mul@class'classificaon'(K'classes)' y R K e.g.''''''''''',''''''''''''',''''''''''''''''',' pedestrian'''car'''''motorcycle'''truck' ''K'output'units'(s L- = K)' ' Slide'by'Andrew'Ng' 9'

Understanding'Representaons' '

Represenng'Boolean'Funcons' Simple'example:'AND' Logisc'/'Sigmoid'Funcon' g(z) @3' +' +' h (x) h (x) = g(-3 + x + x ) x ' x ' ' ' h Θ (x)' ' ' g(@3)' '' ' ' g(@)' '' ' ' g(@)' '' ' ' g()' '' ' Based'on'slide'and'example'by'Andrew'Ng' '

Represenng'Boolean'Funcons' @3' AND' @' OR' +' +' h (x) +' +' h (x) NOT' (NOT'x )'AND'(NOT'x )' +' @' h (x) +' @' @' h (x) '

Combining'Representaons'to'Create' Non@Linear'Funcons' (NOT'x )'AND'(NOT'x )' @3' AND' +' @' OR' +' +' h (x) @' @' h (x) +' +' h (x) NOT'XOR' II III I IV @3' +' @' +' in'i +' in'iii +' @' +' I or III h (x) @' Based'on'example'by'Andrew'Ng' 3'

Layering'Representaons' x... x ' x... x 4' x 4... x '... ' ''pixel'images' d'='4''''''classes' x 38... x 4' Each'image'is' unrolled 'into'a'vector'x'of'pixel'intensies' 4'

Layering'Representaons' x x x 3 ' ' x 4 x 5 x d Input'Layer' Hidden'Layer' Output'Layer' 9 ' 5'

Neural'Network'Learning' 8'

Perceptron'Learning'Rule' + (y h(x))x Intuive'rule:' If'output'is'correct,'don t'change'the'weights' If'output'is'low'(h(x)'=','y ='),'increment' weights'for'all'the'inputs'which'are'' If'output'is'high'(h(x)'=','y ='),'decrement' weights'for'all'inputs'which'are'' ' Perceptron'Convergence'Theorem:'' If'there'is'a'set'of'weights'that'is'consistent'with'the'training' data'(i.e.,'the'data'is'linearly'separable),'the'perceptron' learning'algorithm'will'converge'[minksy'&'papert,'99]' 3'

Batch'Perceptron' Given training data Let [,,...,] (x (i),y (i) ) n i= Repeat: Let [,,...,] for i =...n,do if y (i) x (i) apple + y (i) x (i) Until k + k < // prediction for i th instance is incorrect /n // compute average update Simplest'case:''α'=''and'don t'normalize,'yields'the'fixed' increment'perceptron' Each'increment'of'outer'loop'is'called'an'epoch' Based'on'slide'by'Alan'Fern' 3'

Learning'in'NN:'Backpropagaon' Similar'to'the'perceptron'learning'algorithm,'we'cycle' through'our'examples' If'the'output'of'the'network'is'correct,'no'changes'are'made' If'there'is'an'error,'weights'are'adjusted'to'reduce'the'error' We'are'just'performing''(stochasc)'gradient'descent!' Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par' 3'

Cost'Funcon' Logisc'Regression:' J( ) = n nx dx [y i log h (x i )+( y i ) log ( h (x i ))] + n i= j= j Neural'Network:' h R K (h (x)) i = i th output " J( ) = n KX X X # y ik log (h (x i )) k + ( y ik ) log (h (x i )) k n i= L s X X l Xs l + n l= k= i= j= (l) ji k th 'class: 'true,'predicted' not'k th 'class:'true,'predicted' Based'on'slide'by'Andrew'Ng' 33'

J( ) = Opmizing'the'Neural'Network' n " n X i= LX + n l= Solve'via:'' KX y ik log(h (x i )) k +( k= s X l i= Xs l j= (l) ji # y ik ) log (h (x i )) k Unlike before, J(Θ)'is'not' convex,'so'gd'on'a'neural'net' yields'a'local'opmum' @ @ (l) ij J( ) =a (l) j (l+) i (ignoring'λ;'if'λ = )' Based'on'slide'by'Andrew'Ng' Forward' Propagaon' Backpropagaon' 34'

Forward'Propagaon' Given'one'labeled'training'instance'(x, y):' ' Forward'Propagaon' a () '= x' z () = Θ () a () a () '= g(z () ) [add'a () ]' a () a () a a (4) z = Θ () a () a '= g(z ) [add'a ]' z (4) = Θ a a (4) '= h Θ (x) = g(z (4) )' Based'on'slide'by'Andrew'Ng' 35'

Backpropagaon'Intuion' () (4) () δ j (l) =' error 'of'node'j'in'layer'l Formally,' (l) j = @ cost(x @z (l) i ) j where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) Based'on'slide'by'Andrew'Ng' 3'

Backpropagaon'Intuion' () (4) () δ (4) = a (4) y δ j (l) =' error 'of'node'j'in'layer'l Formally,' Based'on'slide'by'Andrew'Ng' (l) j = @ cost(x @z (l) i ) j where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) 38'

Backpropagaon'Intuion' () (4) () δ = Θ δ (4) '' δ j (l) =' error 'of'node'j'in'layer'l Formally,' Based'on'slide'by'Andrew'Ng' (l) j = @ cost(x @z (l) i ) j where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) 39'

Backpropagaon'Intuion' () (4) δ j (l) =' error 'of'node'j'in'layer'l Formally,' Based'on'slide'by'Andrew'Ng' () (l) j = @ cost(x @z (l) i ) j δ = Θ δ (4) '' δ = Θ δ (4) '' where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) ' 4'

Backpropagaon'Intuion' () () (4) () () δ () = Θ () δ '+'Θ () δ ' δ j (l) =' error 'of'node'j'in'layer'l Formally,' Based'on'slide'by'Andrew'Ng' (l) j = @ cost(x @z (l) i ) j where cost(x i )=y i log h (x i )+( y i ) log( h (x i )) 4'

Backpropagaon:'Gradient'Computaon' Let'δ j (l) =' error 'of'node'j'in'layer'l (#layers'l'='4) Backpropagaon'' Element@wise' product'.*' δ (4) = a (4) y δ = (Θ ) T δ (4).* g (z ) δ () = (Θ () ) T δ.* g (z () ) (No'δ () ) δ () δ g (z ) = a.* ( a )' g (z () ) = a ().* ( a () )' δ (4) ' ' Based'on'slide'by'Andrew'Ng' 4'

Backpropagaon' (l) ij Set = 8l, i, j For each training instance (x i,y i ): Set a () = x i Compute {a (),...,a (L) } via forward propagation Compute (L) = a (L) y i Compute errors { (L ),..., () } (l) Compute gradients ij = (l) ij + a(l) (l) (l+) j ( Compute avg regularized gradient D (l) ij D (l) is'the'matrix'of'paral'derivaves'of'j(θ)''' (Used'to'accumulate'gradient)' i ( (l) = n ij + (l) (l) n ij (l) (l) (l) ij if j = otherwise Based'on'slide'by'Andrew'Ng' 44'

Training'a'Neural'Network'via'Gradient' Descent'with'Backprop' Given: training set {(x,y ),...,(x n,y n )} Initialize all (l) randomly (NOT to!) Loop // each iteration is called an epoch (l) ij Set = 8l, i, j For each training instance (x i,y i ): Set a () = x i Compute {a (),...,a (L) } via forward propagation Compute (L) = a (L) y i Compute errors { (L ),..., () } Compute gradients (l) ij = (l) ij + a(l) j Compute avg regularized gradient D (l) ij (l+) i = ( n Update weights via gradient step (l) ij = (l) ij Until weights converge or max #epochs is reached (Used'to'accumulate'gradient)' n (l) ij + (l) (l) ij D (l) ij ij if j = otherwise Backpropagaon' Based'on'slide'by'Andrew'Ng' 45'

Inializaon' Several'Praccal'Tricks' Problem'is'highly'non@convex,'and'heuriscs'exist'to' start'training'(at'the'least,'randomize'inial'weights)' Opmizaon'tricks' Momentum@based'methods' Decaying'step'size' Dropout'to'avoid'co@adaptaon'/'overfiyng' Minibatch' Use'more'than'a'single'point'to'esmate'gradient' ' 4'

Neural'Networks'vs'Deep'Learning?' DNN'are'big'neural'networks' Depth:'onen'~5'layers'(but'some'have'+)' Typically'not'fully'connected!' Width:'hidden'nodes'per'layer'in'the'thousands' Parameters:'millions'to'billions' Algorithms'/'Compung' New'algorithms'(pre@training,'layer@wise'training,' dropout,'etc.)' Heavy'compung'requirements'(GPUs'are'essenal)' 4'