Computing Gradient Hung-yi Lee 李宏毅
Introduction Backpropagation: an efficient way to compute the gradient Prerequisite Backpropagation for feedforward net: http://speech.ee.ntu.edu.tw/~tkagk/courses/mlds_05_/lecture/ DNN%0backprop.ecm.mp4/index.htm Simpe version: https://www.youtube.com/watch?v=ibjptrp5mce Backpropagation through time for RNN: http://speech.ee.ntu.edu.tw/~tkagk/courses/mlds_05_/lecture/rnn %0training%0(v6).ecm.mp4/index.htm Understanding backpropagation by computationa graph Tensorfow, Theano, CNTK, etc.
Computationa Graph
Computationa Graph A anguage describing a function Node: variabe (scaar, vector, tensor ) Edge: operation (simpe function) a = f b, c b c f a Exampe y = f g h x u = h x v = g u y = f v x u v y h g f
Computationa Graph Exampe: e = (a+b) (b+) c = a + b d = b + e = c d e 6 c + 3 d + a b
Review: Chain Rue Case z f x y gx z hy Case g h x y z x y z dz dx dz dy dy dx z f s x gs y hs z k x, y s x y z s x y z dz ds z x dx ds z y dy ds
Computationa Graph Exampe: e = (a+b) (b+) Compute eτ b =x(b+)+x(a+b) Sum over a paths from b to e c a c + e c e e b e c c b e =d =c d =b+ c = d = = b b d e d d b =a+b + a b
Computationa Graph Exampe: e = (a+b) (b+) Compute Τ e b a = 3, b = c a c + e c e e b 5 e c c b e d 5 d 3 c = d = = b b e =d 3 =c5 d + d b a 3 b
Computationa Graph Exampe: e = (a+b) (b+) Compute e Τ b a = 3, b = c a =8 8e e e =d 3 =c5 c d c d + c = d = = b b + a b Start with
Computationa Graph Exampe: e = (a+b) (b+) Compute a Τ e a a = 3, b = c a =3 c + e c c = d = = b b Start with 3e e =d 3 =c5 d b d +
Computationa Graph Exampe: e = (a+b) (b+) Reverse mode What is the benefit? e e Compute and e Start with a b e e a = 3, b = =d 3 =c5 c d 3c d5 c a e 3a a + c = d = = b b e 8b b +
Computationa Graph Parameter sharing: the same parameters appearing in different nodes y = xe x y x = x ex x x x y x =? e x + x e x x exp u x y x = x ex x x x v u y x = ex = e x x u y = e x
Computationa Graph for Feedforward Net
Review: Backpropagation i ij i ij z w z w C C Forward Pass Backward Pass z a b a W z x a j j j i w ij Layer Layer b x W z z a i Error signa z i T W z C L y L z δ L T L L L W z
Review: Backpropagation Layer L- L- z z L m z L L m W L T Layer L L n L z L z L z n (we do not use softmax here) y C C y C y C y n C w ij z w δ i ij L C z i i Backward Pass L L z C L L T L z W T z W Error signa y
Feedforward Network y = σ W L σ W σ W x + b + b + b L z = W x + b z = W a + b x z a σ z σ y W W b b
Loss Function of Feedforward Network C = L y, y y C z = W x + b z = W a + b x z a σ z σ y W W b b
Gradient of Cost Function C = L y, y To compute the gradient Computing the partia derivative y on the edge Using reverse mode Output is aways a scaar C x z a σ z σ y C W C b W C W W C b b b a z =? vector vector
Jacobian Matrix y = f x x = x x x 3 y = y y y x = Exampe y Τ x y Τ x y Τ x 3 y Τ x y Τ x y Τ x 3 size of x size of y x + x x 3 x 3 = f x x x 3 y x = x 3 x 0 0
Gradient of Cost Function Last Layer y y y i Cross Entropy: C ŷ ŷ ŷ i y = 0 0 C = ogy r C y = 0 /y r r y C y i =? i = r: i r: C = L y, y Τ C y C y Τ C y r = /y r Τ C y i = 0
Gradient of Cost Function y z is a Jacobian matrix i-th row, j-th coumn: i j: y i Τ z j = 0 i = j: y i = σ z i y i Τ z i = σ z i y square z y σ z i y z C = L y, y Τ y i z CΤ y j y z σ C y How about softmax? z i z Diagona Matrix
a z y W y C C = L y, y z = W a y z z a z W z a is a Jacobian matrix z i = w i a + w i a + + w in a n z i a j = W ij z a i-th row, j-th coumn: σ W i i i b b b a a a w w w w z z z Τ C y
z z z i z W = z i W jk =? m mxn z i = w i a + w i a + w w w w i j: i = j: + w in a n a a a i (j-)xn+k z i W jk = 0 z i W ik = a k i b b b i a W z z y σ n z m W mxn z i W jk z = W a W y C = L y, y Τ C y y Considering W as a mxn vector C (considering z Τ W as a tensor makes thing easier)
z W = z i W jk =? m m i j: i = j: j = j = j = 3 a 0 i= a i= a m n mxn (j-)xn+k i z i W jk = 0 z i W ik = a k 0 a W z z y σ n z m W mxn z i W jk z = W a W y C = L y, y Τ C y y Considering W as a mxn vector C
C C y y z W a W = z z W = [ C ] W ij C = L y, y y C x W W z W a z z W a σ z W W z Τ C y y z σ y C W
Question Q: Ony backward pass for computationa graph? Q: Do we get the same resuts from the two different approaches?
Computationa Graph for Recurrent Network
Recurrent Network y y y 3 h 0 f h f h f h 3 x x x 3 y t, h t = f x t, h t ; W i, W h, W o h t = σ W i x t + W h h t y t = softmax W o h t (biases are ignored here)
Recurrent Network y t, h t = f x t, h t ; W i, W h, W o a t = σ W i x t + W h h t y t = softmax W o h t W o o softmax y a 0 m z + σ h W h n x W i
Recurrent Network y t, h t = f x t, h t ; W i, W h, W o a t = σ W i x t + W h h t y t = softmax W o h t W o y y t = softmax W o h t h 0 h W h h t = σ W i x t + W h h t x W i
C = C + C + C 3 Recurrent Network C y C y C y 3 C 3 W o y W o y W o y 3 h 0 h h h 3 W h W h W h x W i x W i x 3 W i
C = C + C + C 3 Recurrent Network C y C C Τ y y C y 3 C 3 C Τ y C 3 Τ y 3 W o y W o y W o y 3 h 0 y Τ h h y Τ h h Τ h h y 3 Τ h 3 h 3 Τ h h 3 C W h W h x h Τ W i W h C W h W h x h Τ W i W h C W h W h x 3 h 3 Τ W h W i
C W h C = = W h C 3 = W h C y C y h C 3 y 3 h 3 h y h + + y h h y 3 h 3 h h C C W h = W h + C + W h 3 C W h h W h 3
Reference Textbook: Deep Learning Chapter 6.5 Cacuus on Computationa Graphs: Backpropagation https://coah.github.io/posts/05-08- Backprop/ On chain rue, computationa graphs, and backpropagation http://outace.com/computationa-graph/
Acknowedgement 感謝翁丞世同學找到投影片上的錯誤