Tutorial on Multinomial Logistic Regression

Σχετικά έγγραφα
Other Test Constructions: Likelihood Ratio & Bayes Tests

Section 8.3 Trigonometric Equations

Partial Differential Equations in Biology The boundary element method. March 26, 2013

2 Composition. Invertible Mappings

Homework 3 Solutions

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Numerical Analysis FMN011

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Solution Series 9. i=1 x i and i=1 x i.

derivation of the Laplacian from rectangular to spherical coordinates

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

w o = R 1 p. (1) R = p =. = 1

Homework 8 Model Solution Section

4.6 Autoregressive Moving Average Model ARMA(1,1)

ST5224: Advanced Statistical Theory II

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Notes on the Open Economy

Statistical Inference I Locally most powerful tests

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Reminders: linear functions

Math221: HW# 1 solutions

Solutions to Exercise Sheet 5

Approximation of distance between locations on earth given by latitude and longitude

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

Parametrized Surfaces

6.3 Forecasting ARMA processes

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Section 7.6 Double and Half Angle Formulas

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

The Simply Typed Lambda Calculus

Matrices and Determinants

EE512: Error Control Coding

Areas and Lengths in Polar Coordinates

Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

Example Sheet 3 Solutions

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Durbin-Levinson recursive method

g-selberg integrals MV Conjecture An A 2 Selberg integral Summary Long Live the King Ole Warnaar Department of Mathematics Long Live the King

Instruction Execution Times

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Congruence Classes of Invertible Matrices of Order 3 over F 2

Areas and Lengths in Polar Coordinates

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

Fractional Colorings and Zykov Products of graphs

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

TMA4115 Matematikk 3

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

C.S. 430 Assignment 6, Sample Solutions

Concrete Mathematics Exercises from 30 September 2016

Spherical Coordinates

6. MAXIMUM LIKELIHOOD ESTIMATION

HOMEWORK#1. t E(x) = 1 λ = (b) Find the median lifetime of a randomly selected light bulb. Answer:

New bounds for spherical two-distance sets and equiangular lines

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Second Order Partial Differential Equations

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Finite Field Problems: Solutions

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Lecture 13 - Root Space Decomposition II

5.4 The Poisson Distribution.

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

MATH423 String Theory Solutions 4. = 0 τ = f(s). (1) dτ ds = dxµ dτ f (s) (2) dτ 2 [f (s)] 2 + dxµ. dτ f (s) (3)

Second Order RLC Filters

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Problem Set 3: Solutions

Strain gauge and rosettes

Space-Time Symmetries

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Example of the Baum-Welch Algorithm

D Alembert s Solution to the Wave Equation

An Inventory of Continuous Distributions


Partial Trace and Partial Transpose

Lecture 34 Bootstrap confidence intervals

If we restrict the domain of y = sin x to [ π, π ], the restrict function. y = sin x, π 2 x π 2

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

Computing Gradient. Hung-yi Lee 李宏毅

CHAPTER 101 FOURIER SERIES FOR PERIODIC FUNCTIONS OF PERIOD

Derivation of Optical-Bloch Equations

If we restrict the domain of y = sin x to [ π 2, π 2

Srednicki Chapter 55

Figure A.2: MPC and MPCP Age Profiles (estimating ρ, ρ = 2, φ = 0.03)..

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

PARTIAL NOTES for 6.1 Trigonometric Identities

The ε-pseudospectrum of a Matrix

1 String with massive end-points

Nondifferentiable Convex Functions

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

( ) 2 and compare to M.

CRASH COURSE IN PRECALCULUS

Transcript:

Tutorial on Multinomial Logistic Regression Javier R Movellan June 19, 2013 1

1 General Model The inputs are n-dimensional vectors the outputs are c-dimensional vectors The training sample consist of m input output pairs We organize the example inputs as an m n matrix x The corresponding example outputs are organized as a m c matrix y The models under consideration make predictions ŷ = h(u) (1) u = xθ (2) θ is a n c weight matrix Note θ ij can be seen as the connection strength from input variable X i to output variable U j We evaluate the 1 c c c n c m examples n Figure 1: There are m examples, n input variables and c output variables optimality of ŷ, and thus of θ, using the following criterion L(θ) = Φ(θ) + Γ(θ) (3) m Φ(θ) = w j ρ j (4) j=1 Γ(θ) is a prior penalty function of the parameters, ρ j = f(y j, ŷ j ) (5) measures the mismatch between the j th rows of y and ŷ The terms w 1,, w m are positive weights that capture the relative importance of each of the m input-output pairs 2

Our goal is to find a matrix θ that minimizes L A standard approach for minimizing L is the Newton-Raphson algorithm which calls for computation of the gradient and the Hessian matrix 11 Gradient Vector Let θ i R n be the i th column of θ, ie, the set of connection strengths from the n input units to the i th output unit The overall gradient of Φ with respect to θ is as follows θ 1 Φ θ 2 Φ vec[θ] Φ = (6) θ c Φ vec[θ] is the vectorized version of θ and the gradient of a vector q with respect to a vector p is defined as follows p q = q p (7) Using the chain rule for gradients we get Note Φ = u i ρ Φ u i ρ (8) Moreover ρ u i = u i = (xθ) i = xθ i (9) u i = θ ix = x (10) 0 0 ρ 0 2 u = 2i 0 0 0 0 3 (11)

and Φ ρ = w 1 w m (12) Φ = x 0 0 ρ 0 2 0 0 0 0 w 1 w 2 w m (13) Equivalently θ i Φ = Φ = x wψ i (14) w = w 1 0 0 0 w 2 0 (15) 0 0 0 w m and Ψ i = vec[θ] Φ = ρ 2 x wψ 1 x wψ c (16) (17) 4

12 Hessian Matrix The Hessian of a scalar v with respect to a vector u is defined as follows ) 2 uv = u ( u v = ( u v) (18) u x wψ 1 2 vec[θ]φ = vec[θ] (19) x wψ c = ( ) vec[θ] x wψ 1,, x wψ c (20) vec[θ] x wψ i = vec[θ] Ψ i Ψi x wψ i ( vec[θ] Ψ i )wx (21) ( 2 vec[θ]φ = ( vec[θ] Ψ 1 )wx,, ( vec[θ] Ψ c )wx ) (22) 2 vec[θ]φ = = θ 1 Ψ 1 θ c Ψ 1 ( θ 1 Ψ 1 )wx ( θ c Ψ 1 )wx wx,, θ 1 Ψ c θ c Ψ c ( θ 1 Ψ c )wx ( θ c Ψ c )wx wx (23) (24) Note θ i Ψ j = θ i u j u j Ψ j = x Λ ij (25) Λ ij = u j Ψ j = Ψ 1j 0 0 Ψ 0 2j 0 0 0 Ψ mj (26) 5

θ i Ψ j = Ψ j = Λ ij = Ψ x 1j Ψ 11 x mj m1 x 1n Ψ 1j u ni x mn Ψ mj Ψ 1j 0 0 Ψ 0 2j 0 0 0 Ψ mj = x Λ ij (27) (28) 2 vec[θ]φ = 2 Quadratic Priors Let then x Λ 11 wx x Λ 1c wx x Λ c1 wx x Λ cc wx (29) Γ(θ) = 1 2 (vec[θ] vec[µ]) σ 1 (vec[θ] vec[µ]) (30) vec[θ] Γ(θ) = σ 1 (vec[θ] vec[µ]) (31) 2 vec[θ] = vec[θ] vec[θ] = σ 1 (32) 3 Newton-Raphson Optimization vec[θ (k+1) ] = vec[θ (k) ] ( 2 vec[θ]l(θ (k)) 1 vec[θ] L(θ (k) ) (33) θ (k) is the value of θ after k iterations of the optimization algorithm 6

4 Multivariate Linear Regression In this case ŷ i = u i (34) ρ i = (ŷ ik y ik ) 2 (35) k I m is the m m identity matrix Ψ i = ŷ i y i (36) Λ ij = I m (37) Φ θ j = x wx (38) 5 Multinomial Logistic Regression Let ŷ ij = f j (u i ) = e u ij c k=1 eu ik (39) and ρ i the negative log-likelihood of the output vcector y i given the input vector x i c ρi = y ik log ŷ ik (40) Note ρ i u ik = k=1 ŷ ij u ik = ŷ ij log ŷ ij u ik = ŷ ij (δ jk ŷ ik ) (41) c j=1 y ij ŷ ij ŷ ij u ij = c y ij (δ jk ŷ ik ) = ŷ ik y ik (42) j=1 we used the fact that j y ij = 1 7

Moreover Ψ i = ρ 2 Λ ij = y 1i ŷ 1i = y 2i ŷ 2i = ŷ i y i (43) y mi ŷ mi Ψ 1j 0 0 Ψ 0 2j 0 0 0 Ψ mj (44) Ψ kj u ki = ŷ kj u ki = ŷ kj (δ ij ŷ kj ) (45) 51 Relationship to Linear Regression Note that the gradient in multinomial logistic regression is identical to the gradient in multivariate linear regression θ i Φ = ŷ i y i (46) The Hessians would are also very simmilar In linear regression and in logistic regression Φ θ j Φ θ j = x wx (47) = x wλ ij x (48) which can be seen as special case of linear regression the weight matrix w is substituted by the wλ ij matrix 8

52 Summary Training Data Inputs x R m R n Rows are examples, columns are input variables Outputs y R m R c Rows are examples, columns are labels or label probabilities All entries are non-negative and each row add up to 1 Example Weights w R m R r Diagonal matrix Each term is positive and represents the relative importance of each example Gradients vec[θ] L = θ 1 Φ θ 2 Φ θ c Φ + vec[θ] Γ (49) θ i Φ = Φ = x wψ i (50) and and Ψ i = ρ 2 = ŷ i y i (51) Hessian Matrices vec[θ] Γ = σ 1 (vec[θ] vec[µ]) (52) 2 vec[θ]l = 2 vec[θ]φ + 2 vec[θ]γ (53) 9

2 vec[θ]φ = θ 1 Φ θ 1 Φ θ c θ 1 θ 1 θ c Φ θ c Φ θ c (54) Φ θ j = x Λ ij wx (55) Λ ij = Ψ 1j 0 0 Ψ 0 2j 0 0 0 Ψ mj (56) and Ψ kj u ki = ŷ kj u ki = ŷ kj (δ ij ŷ kj ) (57) and 2 vec[θ]γ = σ 1 (58) Learning Rule (Newton-Raphson) vec[θ (k+1) ] = vec[θ (k) ] ( 2 vec[θ]l(θ (k)) 1 vec[θ] L(θ (k) ) (59) θ (k) is the value of θ after k iterations of the learning rule 6 Appendix: L-p priors From a Bayesian point of view the Φ function can be seen as a log-likelihood term At times it may be useful to add a log prior term over θ, also known as a regularization term Most prior terms of interest have the following form 10

Γ = α n i=1 c j=1 1 ( p h(θ i,j )) (60) p α is a non-negative constant If h is the absolute value function then Γ is the L-p norm of θ Two popular options are p = 2 and p = 1 Note the absolute value function is not differentiable, and thus when p is odd, it useful to approximate it with a differentiable function Here we will use the log hyperbolic cosine function β 0 is a gain parameter and h(x) = 1 log(2cosh(βx)) (61) β Note cosh(x) = ex + e x 2 (62) lim h(x) = abs(x) (63) β h (x) = dh(x) dx = eβx e βx = tanh(x) (64) e βx + e βx and h (x) = d2 h(x) dx 2 dh(x) lim β dx = d tanh(βx) dx = sign(x) (65) = β (1 tanh 2 (βx)) (66) 61 Gradient and Hessian Γ ( ) = α h p 1 (θ ij ) h (θ ij ) θ ij ( = α ( 1 ) β log(2cosh(βx)))p 1 tanh(x) (67) (68) 11

The nc nc Hessian matrix is diagonal Each term in the diagonal has the following form 2 Γ θ 2 ij ( ) = α (p 1)h(θ ij ) p 2 (h (θ ij )) 2 + h(θ ij ) p 1 h (θ ij ) (69) 2 Γ θ 2 ij ( = α (p 1)( 1 β log(2cosh(βx)))p 2 (tanh(βx)) 2 (70) + ( 1 ) β log(2cosh(βx)))p 1 β (1 tanh 2 (βx) (71) Note for p = 1, we get and for p = 1, and β we get Γ = α 1 θ ij β tanh(βθ ij) (72) 2 Γ = α β (1 tanh 2 (βθ θij 2 ij )) (73) (74) Γ = α θ ij θ ij (75) 2 Γ = α θij 2 (76) 12

5 4 logcosh abs h(x) 3 2 1 0 4 3 2 1 0 1 2 3 4 x 1 05 tanh sign dh(x)/dx 0 05 1 4 3 2 1 0 1 2 3 4 x Figure 2: The absolute value function can be approximated by the log of a hyperbolic cosine The derivatie of the absolute value is the sign function It can be approximated by the derivative of the log cosh function, which is the hyperbolic tangent In this figure we used gain β = 4 13