Computing Gradient. Hung-yi Lee 李宏毅

Σχετικά έγγραφα
Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

Appendix A. Curvilinear coordinates. A.1 Lamé coefficients. Consider set of equations. ξ i = ξ i (x 1,x 2,x 3 ), i = 1,2,3

Solutions to Exercise Sheet 5

Tutorial on Multinomial Logistic Regression

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Srednicki Chapter 55

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Homework 8 Model Solution Section

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

1. (a) (5 points) Find the unit tangent and unit normal vectors T and N to the curve. r(t) = 3cost, 4t, 3sint

TMA4115 Matematikk 3

Parametrized Surfaces

Concrete Mathematics Exercises from 30 September 2016

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

5.4 The Poisson Distribution.

Numerical Analysis FMN011

Problem Set 3: Solutions

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

4.6 Autoregressive Moving Average Model ARMA(1,1)

Section 8.3 Trigonometric Equations

Πανεπιστήµιο Κρήτης - Τµήµα Επιστήµης Υπολογιστών. ΗΥ-570: Στατιστική Επεξεργασία Σήµατος. ιδάσκων : Α. Μουχτάρης. εύτερη Σειρά Ασκήσεων.

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

Congruence Classes of Invertible Matrices of Order 3 over F 2

Matrices and Determinants

w o = R 1 p. (1) R = p =. = 1

Notes on the Open Economy

Chapter 6 BLM Answers

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Laplace Expansion. Peter McCullagh. WHOA-PSI, St Louis August, Department of Statistics University of Chicago

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Solution Series 9. i=1 x i and i=1 x i.

Section 9.2 Polar Equations and Graphs

Lifting Entry (continued)

Advanced Subsidiary Unit 1: Understanding and Written Response

Other Test Constructions: Likelihood Ratio & Bayes Tests

Section 7.6 Double and Half Angle Formulas

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Laplace s Equation in Spherical Polar Coördinates

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

ECE 468: Digital Image Processing. Lecture 8

Math221: HW# 1 solutions

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Partial Trace and Partial Transpose

Πτυχιακή Εργασι α «Εκτι μήσή τής ποιο τήτας εικο νων με τήν χρή σή τεχνήτων νευρωνικων δικτυ ων»

Exercises to Statistics of Material Fatigue No. 5

Variational Wavefunction for the Helium Atom

Tridiagonal matrices. Gérard MEURANT. October, 2008

Instruction Execution Times

PARTIAL NOTES for 6.1 Trigonometric Identities

Space-Time Symmetries

Fractional Colorings and Zykov Products of graphs

EE512: Error Control Coding

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

MATH423 String Theory Solutions 4. = 0 τ = f(s). (1) dτ ds = dxµ dτ f (s) (2) dτ 2 [f (s)] 2 + dxµ. dτ f (s) (3)

Homework for 1/27 Due 2/5

derivation of the Laplacian from rectangular to spherical coordinates

Reminders: linear functions

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ

The Simply Typed Lambda Calculus

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

ΜΑΣ 473/673: Μέθοδοι Πεπερασμένων Στοιχείων

CRASH COURSE IN PRECALCULUS

CHAPTER 101 FOURIER SERIES FOR PERIODIC FUNCTIONS OF PERIOD

ENGR 691/692 Section 66 (Fall 06): Machine Learning Assigned: August 30 Homework 1: Bayesian Decision Theory (solutions) Due: September 13

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

3+1 Splitting of the Generalized Harmonic Equations

ECE Spring Prof. David R. Jackson ECE Dept. Notes 2

Distances in Sierpiński Triangle Graphs

Network Algorithms and Complexity Παραλληλοποίηση του αλγορίθμου του Prim. Αικατερίνη Κούκιου

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Assignment 1 Solutions Complex Sinusoids

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Statistical Inference I Locally most powerful tests

Meta-Learning and Universality

HMY 795: Αναγνώριση Προτύπων

ST5224: Advanced Statistical Theory II

ΑΚΑ ΗΜΙΑ ΕΜΠΟΡΙΚΟΥ ΝΑΥΤΙΚΟΥ ΜΑΚΕ ΟΝΙΑΣ ΣΧΟΛΗ ΜΗΧΑΝΙΚΩΝ ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ


6.3 Forecasting ARMA processes

Business English. Ενότητα # 9: Financial Planning. Ευαγγελία Κουτσογιάννη Τμήμα Διοίκησης Επιχειρήσεων

2 Composition. Invertible Mappings

Minimum Spanning Tree: Prim's Algorithm

Finite Field Problems: Solutions

department listing department name αχχουντσ ϕανε βαλικτ δδσϕηασδδη σδηφγ ασκϕηλκ τεχηνιχαλ αλαν ϕουν διξ τεχηνιχαλ ϕοην µαριανι

Higher Derivative Gravity Theories

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Neural'Networks' Robot Image Credit: Viktoriya Sukhanova 123RF.com

A Hierarchy of Theta Bodies for Polynomial Systems

Ηλεκτρονικοί Υπολογιστές IV

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

D Alembert s Solution to the Wave Equation

Στον παρακάτω πίνακα παρουσιάζονται οι διάφορες δραστηριότητες που απαιτούνται στο πλαίσιο υλοποίησης ενός μικρού έργου:

Neural Networks - Prolog 2 - Logistic Regression

FACE VALUE: THE POWER OF IMAGES AT APHRODISIAS

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

CHAPTER 12: PERIMETER, AREA, CIRCUMFERENCE, AND 12.1 INTRODUCTION TO GEOMETRIC 12.2 PERIMETER: SQUARES, RECTANGLES,

Transcript:

Computing Gradient Hung-yi Lee 李宏毅

Introduction Backpropagation: an efficient way to compute the gradient Prerequisite Backpropagation for feedforward net: http://speech.ee.ntu.edu.tw/~tkagk/courses/mlds_05_/lecture/ DNN%0backprop.ecm.mp4/index.htm Simpe version: https://www.youtube.com/watch?v=ibjptrp5mce Backpropagation through time for RNN: http://speech.ee.ntu.edu.tw/~tkagk/courses/mlds_05_/lecture/rnn %0training%0(v6).ecm.mp4/index.htm Understanding backpropagation by computationa graph Tensorfow, Theano, CNTK, etc.

Computationa Graph

Computationa Graph A anguage describing a function Node: variabe (scaar, vector, tensor ) Edge: operation (simpe function) a = f b, c b c f a Exampe y = f g h x u = h x v = g u y = f v x u v y h g f

Computationa Graph Exampe: e = (a+b) (b+) c = a + b d = b + e = c d e 6 c + 3 d + a b

Review: Chain Rue Case z f x y gx z hy Case g h x y z x y z dz dx dz dy dy dx z f s x gs y hs z k x, y s x y z s x y z dz ds z x dx ds z y dy ds

Computationa Graph Exampe: e = (a+b) (b+) Compute eτ b =x(b+)+x(a+b) Sum over a paths from b to e c a c + e c e e b e c c b e =d =c d =b+ c = d = = b b d e d d b =a+b + a b

Computationa Graph Exampe: e = (a+b) (b+) Compute Τ e b a = 3, b = c a c + e c e e b 5 e c c b e d 5 d 3 c = d = = b b e =d 3 =c5 d + d b a 3 b

Computationa Graph Exampe: e = (a+b) (b+) Compute e Τ b a = 3, b = c a =8 8e e e =d 3 =c5 c d c d + c = d = = b b + a b Start with

Computationa Graph Exampe: e = (a+b) (b+) Compute a Τ e a a = 3, b = c a =3 c + e c c = d = = b b Start with 3e e =d 3 =c5 d b d +

Computationa Graph Exampe: e = (a+b) (b+) Reverse mode What is the benefit? e e Compute and e Start with a b e e a = 3, b = =d 3 =c5 c d 3c d5 c a e 3a a + c = d = = b b e 8b b +

Computationa Graph Parameter sharing: the same parameters appearing in different nodes y = xe x y x = x ex x x x y x =? e x + x e x x exp u x y x = x ex x x x v u y x = ex = e x x u y = e x

Computationa Graph for Feedforward Net

Review: Backpropagation i ij i ij z w z w C C Forward Pass Backward Pass z a b a W z x a j j j i w ij Layer Layer b x W z z a i Error signa z i T W z C L y L z δ L T L L L W z

Review: Backpropagation Layer L- L- z z L m z L L m W L T Layer L L n L z L z L z n (we do not use softmax here) y C C y C y C y n C w ij z w δ i ij L C z i i Backward Pass L L z C L L T L z W T z W Error signa y

Feedforward Network y = σ W L σ W σ W x + b + b + b L z = W x + b z = W a + b x z a σ z σ y W W b b

Loss Function of Feedforward Network C = L y, y y C z = W x + b z = W a + b x z a σ z σ y W W b b

Gradient of Cost Function C = L y, y To compute the gradient Computing the partia derivative y on the edge Using reverse mode Output is aways a scaar C x z a σ z σ y C W C b W C W W C b b b a z =? vector vector

Jacobian Matrix y = f x x = x x x 3 y = y y y x = Exampe y Τ x y Τ x y Τ x 3 y Τ x y Τ x y Τ x 3 size of x size of y x + x x 3 x 3 = f x x x 3 y x = x 3 x 0 0

Gradient of Cost Function Last Layer y y y i Cross Entropy: C ŷ ŷ ŷ i y = 0 0 C = ogy r C y = 0 /y r r y C y i =? i = r: i r: C = L y, y Τ C y C y Τ C y r = /y r Τ C y i = 0

Gradient of Cost Function y z is a Jacobian matrix i-th row, j-th coumn: i j: y i Τ z j = 0 i = j: y i = σ z i y i Τ z i = σ z i y square z y σ z i y z C = L y, y Τ y i z CΤ y j y z σ C y How about softmax? z i z Diagona Matrix

a z y W y C C = L y, y z = W a y z z a z W z a is a Jacobian matrix z i = w i a + w i a + + w in a n z i a j = W ij z a i-th row, j-th coumn: σ W i i i b b b a a a w w w w z z z Τ C y

z z z i z W = z i W jk =? m mxn z i = w i a + w i a + w w w w i j: i = j: + w in a n a a a i (j-)xn+k z i W jk = 0 z i W ik = a k i b b b i a W z z y σ n z m W mxn z i W jk z = W a W y C = L y, y Τ C y y Considering W as a mxn vector C (considering z Τ W as a tensor makes thing easier)

z W = z i W jk =? m m i j: i = j: j = j = j = 3 a 0 i= a i= a m n mxn (j-)xn+k i z i W jk = 0 z i W ik = a k 0 a W z z y σ n z m W mxn z i W jk z = W a W y C = L y, y Τ C y y Considering W as a mxn vector C

C C y y z W a W = z z W = [ C ] W ij C = L y, y y C x W W z W a z z W a σ z W W z Τ C y y z σ y C W

Question Q: Ony backward pass for computationa graph? Q: Do we get the same resuts from the two different approaches?

Computationa Graph for Recurrent Network

Recurrent Network y y y 3 h 0 f h f h f h 3 x x x 3 y t, h t = f x t, h t ; W i, W h, W o h t = σ W i x t + W h h t y t = softmax W o h t (biases are ignored here)

Recurrent Network y t, h t = f x t, h t ; W i, W h, W o a t = σ W i x t + W h h t y t = softmax W o h t W o o softmax y a 0 m z + σ h W h n x W i

Recurrent Network y t, h t = f x t, h t ; W i, W h, W o a t = σ W i x t + W h h t y t = softmax W o h t W o y y t = softmax W o h t h 0 h W h h t = σ W i x t + W h h t x W i

C = C + C + C 3 Recurrent Network C y C y C y 3 C 3 W o y W o y W o y 3 h 0 h h h 3 W h W h W h x W i x W i x 3 W i

C = C + C + C 3 Recurrent Network C y C C Τ y y C y 3 C 3 C Τ y C 3 Τ y 3 W o y W o y W o y 3 h 0 y Τ h h y Τ h h Τ h h y 3 Τ h 3 h 3 Τ h h 3 C W h W h x h Τ W i W h C W h W h x h Τ W i W h C W h W h x 3 h 3 Τ W h W i

C W h C = = W h C 3 = W h C y C y h C 3 y 3 h 3 h y h + + y h h y 3 h 3 h h C C W h = W h + C + W h 3 C W h h W h 3

Reference Textbook: Deep Learning Chapter 6.5 Cacuus on Computationa Graphs: Backpropagation https://coah.github.io/posts/05-08- Backprop/ On chain rue, computationa graphs, and backpropagation http://outace.com/computationa-graph/

Acknowedgement 感謝翁丞世同學找到投影片上的錯誤