Tutorial on Multinomial Logistic Regression Javier R Movellan June 19, 2013 1
1 General Model The inputs are n-dimensional vectors the outputs are c-dimensional vectors The training sample consist of m input output pairs We organize the example inputs as an m n matrix x The corresponding example outputs are organized as a m c matrix y The models under consideration make predictions ŷ = h(u) (1) u = xθ (2) θ is a n c weight matrix Note θ ij can be seen as the connection strength from input variable X i to output variable U j We evaluate the 1 c c c n c m examples n Figure 1: There are m examples, n input variables and c output variables optimality of ŷ, and thus of θ, using the following criterion L(θ) = Φ(θ) + Γ(θ) (3) m Φ(θ) = w j ρ j (4) j=1 Γ(θ) is a prior penalty function of the parameters, ρ j = f(y j, ŷ j ) (5) measures the mismatch between the j th rows of y and ŷ The terms w 1,, w m are positive weights that capture the relative importance of each of the m input-output pairs 2
Our goal is to find a matrix θ that minimizes L A standard approach for minimizing L is the Newton-Raphson algorithm which calls for computation of the gradient and the Hessian matrix 11 Gradient Vector Let θ i R n be the i th column of θ, ie, the set of connection strengths from the n input units to the i th output unit The overall gradient of Φ with respect to θ is as follows θ 1 Φ θ 2 Φ vec[θ] Φ = (6) θ c Φ vec[θ] is the vectorized version of θ and the gradient of a vector q with respect to a vector p is defined as follows p q = q p (7) Using the chain rule for gradients we get Note Φ = u i ρ Φ u i ρ (8) Moreover ρ u i = u i = (xθ) i = xθ i (9) u i = θ ix = x (10) 0 0 ρ 0 2 u = 2i 0 0 0 0 3 (11)
and Φ ρ = w 1 w m (12) Φ = x 0 0 ρ 0 2 0 0 0 0 w 1 w 2 w m (13) Equivalently θ i Φ = Φ = x wψ i (14) w = w 1 0 0 0 w 2 0 (15) 0 0 0 w m and Ψ i = vec[θ] Φ = ρ 2 x wψ 1 x wψ c (16) (17) 4
12 Hessian Matrix The Hessian of a scalar v with respect to a vector u is defined as follows ) 2 uv = u ( u v = ( u v) (18) u x wψ 1 2 vec[θ]φ = vec[θ] (19) x wψ c = ( ) vec[θ] x wψ 1,, x wψ c (20) vec[θ] x wψ i = vec[θ] Ψ i Ψi x wψ i ( vec[θ] Ψ i )wx (21) ( 2 vec[θ]φ = ( vec[θ] Ψ 1 )wx,, ( vec[θ] Ψ c )wx ) (22) 2 vec[θ]φ = = θ 1 Ψ 1 θ c Ψ 1 ( θ 1 Ψ 1 )wx ( θ c Ψ 1 )wx wx,, θ 1 Ψ c θ c Ψ c ( θ 1 Ψ c )wx ( θ c Ψ c )wx wx (23) (24) Note θ i Ψ j = θ i u j u j Ψ j = x Λ ij (25) Λ ij = u j Ψ j = Ψ 1j 0 0 Ψ 0 2j 0 0 0 Ψ mj (26) 5
θ i Ψ j = Ψ j = Λ ij = Ψ x 1j Ψ 11 x mj m1 x 1n Ψ 1j u ni x mn Ψ mj Ψ 1j 0 0 Ψ 0 2j 0 0 0 Ψ mj = x Λ ij (27) (28) 2 vec[θ]φ = 2 Quadratic Priors Let then x Λ 11 wx x Λ 1c wx x Λ c1 wx x Λ cc wx (29) Γ(θ) = 1 2 (vec[θ] vec[µ]) σ 1 (vec[θ] vec[µ]) (30) vec[θ] Γ(θ) = σ 1 (vec[θ] vec[µ]) (31) 2 vec[θ] = vec[θ] vec[θ] = σ 1 (32) 3 Newton-Raphson Optimization vec[θ (k+1) ] = vec[θ (k) ] ( 2 vec[θ]l(θ (k)) 1 vec[θ] L(θ (k) ) (33) θ (k) is the value of θ after k iterations of the optimization algorithm 6
4 Multivariate Linear Regression In this case ŷ i = u i (34) ρ i = (ŷ ik y ik ) 2 (35) k I m is the m m identity matrix Ψ i = ŷ i y i (36) Λ ij = I m (37) Φ θ j = x wx (38) 5 Multinomial Logistic Regression Let ŷ ij = f j (u i ) = e u ij c k=1 eu ik (39) and ρ i the negative log-likelihood of the output vcector y i given the input vector x i c ρi = y ik log ŷ ik (40) Note ρ i u ik = k=1 ŷ ij u ik = ŷ ij log ŷ ij u ik = ŷ ij (δ jk ŷ ik ) (41) c j=1 y ij ŷ ij ŷ ij u ij = c y ij (δ jk ŷ ik ) = ŷ ik y ik (42) j=1 we used the fact that j y ij = 1 7
Moreover Ψ i = ρ 2 Λ ij = y 1i ŷ 1i = y 2i ŷ 2i = ŷ i y i (43) y mi ŷ mi Ψ 1j 0 0 Ψ 0 2j 0 0 0 Ψ mj (44) Ψ kj u ki = ŷ kj u ki = ŷ kj (δ ij ŷ kj ) (45) 51 Relationship to Linear Regression Note that the gradient in multinomial logistic regression is identical to the gradient in multivariate linear regression θ i Φ = ŷ i y i (46) The Hessians would are also very simmilar In linear regression and in logistic regression Φ θ j Φ θ j = x wx (47) = x wλ ij x (48) which can be seen as special case of linear regression the weight matrix w is substituted by the wλ ij matrix 8
52 Summary Training Data Inputs x R m R n Rows are examples, columns are input variables Outputs y R m R c Rows are examples, columns are labels or label probabilities All entries are non-negative and each row add up to 1 Example Weights w R m R r Diagonal matrix Each term is positive and represents the relative importance of each example Gradients vec[θ] L = θ 1 Φ θ 2 Φ θ c Φ + vec[θ] Γ (49) θ i Φ = Φ = x wψ i (50) and and Ψ i = ρ 2 = ŷ i y i (51) Hessian Matrices vec[θ] Γ = σ 1 (vec[θ] vec[µ]) (52) 2 vec[θ]l = 2 vec[θ]φ + 2 vec[θ]γ (53) 9
2 vec[θ]φ = θ 1 Φ θ 1 Φ θ c θ 1 θ 1 θ c Φ θ c Φ θ c (54) Φ θ j = x Λ ij wx (55) Λ ij = Ψ 1j 0 0 Ψ 0 2j 0 0 0 Ψ mj (56) and Ψ kj u ki = ŷ kj u ki = ŷ kj (δ ij ŷ kj ) (57) and 2 vec[θ]γ = σ 1 (58) Learning Rule (Newton-Raphson) vec[θ (k+1) ] = vec[θ (k) ] ( 2 vec[θ]l(θ (k)) 1 vec[θ] L(θ (k) ) (59) θ (k) is the value of θ after k iterations of the learning rule 6 Appendix: L-p priors From a Bayesian point of view the Φ function can be seen as a log-likelihood term At times it may be useful to add a log prior term over θ, also known as a regularization term Most prior terms of interest have the following form 10
Γ = α n i=1 c j=1 1 ( p h(θ i,j )) (60) p α is a non-negative constant If h is the absolute value function then Γ is the L-p norm of θ Two popular options are p = 2 and p = 1 Note the absolute value function is not differentiable, and thus when p is odd, it useful to approximate it with a differentiable function Here we will use the log hyperbolic cosine function β 0 is a gain parameter and h(x) = 1 log(2cosh(βx)) (61) β Note cosh(x) = ex + e x 2 (62) lim h(x) = abs(x) (63) β h (x) = dh(x) dx = eβx e βx = tanh(x) (64) e βx + e βx and h (x) = d2 h(x) dx 2 dh(x) lim β dx = d tanh(βx) dx = sign(x) (65) = β (1 tanh 2 (βx)) (66) 61 Gradient and Hessian Γ ( ) = α h p 1 (θ ij ) h (θ ij ) θ ij ( = α ( 1 ) β log(2cosh(βx)))p 1 tanh(x) (67) (68) 11
The nc nc Hessian matrix is diagonal Each term in the diagonal has the following form 2 Γ θ 2 ij ( ) = α (p 1)h(θ ij ) p 2 (h (θ ij )) 2 + h(θ ij ) p 1 h (θ ij ) (69) 2 Γ θ 2 ij ( = α (p 1)( 1 β log(2cosh(βx)))p 2 (tanh(βx)) 2 (70) + ( 1 ) β log(2cosh(βx)))p 1 β (1 tanh 2 (βx) (71) Note for p = 1, we get and for p = 1, and β we get Γ = α 1 θ ij β tanh(βθ ij) (72) 2 Γ = α β (1 tanh 2 (βθ θij 2 ij )) (73) (74) Γ = α θ ij θ ij (75) 2 Γ = α θij 2 (76) 12
5 4 logcosh abs h(x) 3 2 1 0 4 3 2 1 0 1 2 3 4 x 1 05 tanh sign dh(x)/dx 0 05 1 4 3 2 1 0 1 2 3 4 x Figure 2: The absolute value function can be approximated by the log of a hyperbolic cosine The derivatie of the absolute value is the sign function It can be approximated by the derivative of the log cosh function, which is the hyperbolic tangent In this figure we used gain β = 4 13