Lecture 34: Ridge regression and LASSO

Transcript

1 Lecture 34: Ridge regression and LASSO Ridge regression Consider linear model X = Z β + ε, β R p and Var(ε) = σ 2 I n. The LSE is obtained from the minimization problem min X Z β R β 2 (1) p A type of shrinkage estimator is obtained though (1) by adding a penalty on β 2, i.e., min β R p( X Z β 2 + λ β 2 ) (2) where λ 0 is a constant controlling the penalization. β ( X Z β 2 + λ β 2 ) = 2Z τ (X Z β) + 2λβ which gives the solution to (2) as β λ = (Z τ Z + λi p ) 1 Z τ X This estimator is better than the LSE when Z τ Z is nearly singular. UW-Madison (Statistics) Stat 709 Lecture / 16

2 This gives a class of estimators called ridge regression estimators; in particular, λ = 0 gives the LSE. Bias and covariance matrix E( β λ ) = (Z τ Z + λi p ) 1 Z τ E(X) = (Z τ Z + λi p ) 1 Z τ Z β The bias of β λ is then b(β) = (Z τ Z + λi p ) 1 Z τ Z β β = λ(z τ Z + λi p ) 1 β The bias is not 0, but converges to 0 as λ 0. Var( β λ ) = (Z τ Z + λi p ) 1 Z τ Var(X)Z (Z τ Z + λi p ) 1 = σ 2 (Z τ Z + λi p ) 1 Z τ Z (Z τ Z + λi p ) 1 = σ 2 (Z τ Z + λi p ) 1 σ 2 λ(z τ Z + λi p ) 2 It can be seen that the variance converges to 0 if λ and to σ 2 (Z τ Z ) 1 if λ 0. Combining the bias and variance, we get E β λ β 2 = b(β) 2 + E β λ E (β λ ) 2 = λ 2 (Z τ Z + λi p ) 1 β 2 + σ 2 tr[z τ Z (Z τ Z + λi p ) 2 ] UW-Madison (Statistics) Stat 709 Lecture / 16

3 Theorem (Comparison between ridge regression and LSE) Let β = β 0 be the LSE. (i) If 0 < λ < 2σ 2 / β 2, then E β λ β 2 < E β β 2. (ii) Assume that the smallest eigenvalue of Z τ Z = O(n). If λ > 2σ 2 / β 2, then E β λ β 2 > E β β 2 for sufficiently large n; if λ = 2σ 2 / β 2, then E β λ β 2 = E β β 2 + O(n 3 ). Proof. Let Then Hence A = σ 2 (Z τ Z ) 1 σ 2 (Z τ Z + λi p ) 1 Z τ Z (Z τ Z + λi p ) 1 λ 2 (Z τ Z + λi p ) 1 ββ τ (Z τ Z + λi p ) 1 (Z τ Z + λi p )A(Z τ Z + λi p ) = σ 2 (Z τ Z + λi p )(Z τ Z ) 1 (Z τ Z + λi p ) σ 2 Z τ Z λ 2 ββ τ = 2λσ 2 I p + λ 2 σ 2 (Z τ Z ) 1 λ 2 ββ τ A = (Z τ Z + λi p ) 1 [2λσ 2 I p λ 2 ββ τ + λ 2 σ 2 (Z τ Z ) 1 ](Z τ Z + λi p ) 1 UW-Madison (Statistics) Stat 709 Lecture / 16

4 Assume λ > 0 and β 0. Then A > λ 2 σ 2 (Z τ Z + λi p ) 1 (Z τ Z ) 1 (Z τ Z + λi p ) 1 if and only if 2σ 2 λ 1 I p ββ τ > 0 equivalent to λ < 2σ 2 / β 2 This can be shown as follows. If 2σ 2 λ 1 I p ββ τ > 0, then 0 < β τ (2σ 2 λ 1 I p ββ τ )β = 2σ 2 λ 1 β 2 β 4, which means λ < 2σ 2 / β 2. On the other hand, if λ < 2σ 2 / β 2, then (2σ 2 λ 1 I p ββ τ )/ β 2 = (2σ 2 λ 1 β 2 1)I p + I p ββ τ / β 2 > 0, because I p ββ τ / β 2 is a projection matrix whose eigenvalues are either 0 or 1. Since Var( β) = σ 2 (Z τ Z ) 1, using the formula for Var( β λ ) we obtain E β β 2 E β λ β 2 = tr(a) Thus, (i) follows, and (ii) and (iii) follow from λ 2 σ 2 (Z τ Z + λi p ) 1 (Z τ Z ) 1 (Z τ Z + λi p ) 1 λ 2 σ 2 (Z τ Z ) 3 The ridge regression is better if the noise to signal ratio is large. UW-Madison (Statistics) Stat 709 Lecture / 16

5 High dimension problems The dimension of β in a linear model is p (Z is n p) In traditional applications: p << n; p is fixed when n. In modern applications, p is large; p = p n increases as n increases. p = O(n k ): polynomial-type divergence rate p = O(e nν ): ultra-high dimension, where ν is a constant < 1. Non-identifiability of β r = r n : rank of Z. The dimension of R(Z ) is r n. If p > n, then β is not identifiable. This means that there are β and β, β β but Z β = Z β so that the data generated under the models with β and β are the same. It is not possible to estimate all components of β consistently; we are not able to estimate something out of the data range. We can estimate consistently some useful functions of β. We can estimate the projection of β onto R(Z ). Estimation of the projection is sufficient for many problems UW-Madison (Statistics) Stat 709 Lecture / 16

6 Projection Singular value decomposition: Z = PDQ τ P: n r matrix with P τ P = I r (identity matrix) Q: p r matrix with Q τ Q = I r D: r r diagonal matrix of full rank Projection of β onto R(Z ): θ = Z τ (ZZ τ ) Z β = QQ τ β R(Z ) Z θ = PDQ τ (QQ τ β) = PDQ τ β = Z β The model Y = Z β + ε is the same as Y = Z θ + ε Ridge regression estimator of θ θ = (Z τ Z + h n I p ) 1 Z τ X h n > 0 We only need to invert an n n matrix, because (Z τ Z + h n I p ) 1 Z τ = Z τ (ZZ τ + h n I n ) 1 θ is always in R(Z ) UW-Madison (Statistics) Stat 709 Lecture / 16

7 Derivation of the bias of ridge regression estimator Let Γ = ( Q Q ), Q τ Q = 0, ΓΓ τ = Γ τ Γ = I p. Then bias( θ) = E( θ) θ = (Z τ Z + h n I p ) 1 Z τ Z θ θ = (hn 1 Z τ Z + I p ) 1 θ = Γ(hn 1 Γ τ Z τ Z Γ + I p ) 1 Γ τ QQ τ θ = ( ) ( (hn Q Q 1 D 2 + I r ) I p r = ( Q(h 1 n D 2 + I r ) 1 Q ) ( Q τ θ 0 = Q(hn 1 D 2 + I r ) 1 Q τ θ (1 + d 1n /h n ) 1 = Q... ) )( Q τ (1 + d rn /h n ) 1 Q τ ) QQ τ θ Q τ θ where d jn > 0 is the jth diagonal element of D 2 (eigenvalue of Z τ Z ). UW-Madison (Statistics) Stat 709 Lecture / 16

8 Thus, For the variance, bias( θ) 2 = θ τ Q(hn 1 D 2 + I r ) 2 Q τ θ max 1 j r (1 + d jn/h n ) 2 θ τ QQ τ θ h 2 nd 2 1n θ 2 Var( θ) = σ 2 (Z τ Z + h n I p ) 1 Z τ Z (Z τ Z + h n I p ) 1 σ 2 hn 1 I p Theorem (Consistency of θ) Assume that (C1) d 1 1n = O(n η ), η 1 and η does not depend on n. (C2) θ = O(n τ ), τ < η and τ does not depend on n. Then (i) As n, E(l τ θ l τ θ) 2 = O(hn 1 ) + O(hnn 2 2(η τ) ) uniformly over p-dimensional deterministic vector l with l = 1. (ii) n 1 E Z θ Z θ 2 = O(r n n 1 ) + O(h 2 nn (1+η 2τ) ). UW-Madison (Statistics) Stat 709 Lecture / 16

9 Remarks (C2) means that θ is sparse; without any condition, the order of θ 2 could be p. θ β so that (C2) holds if β is sparse. For any fixed l θ, l θ is consistent if hn and h n n (η τ) 0. θ is not sparse even if θ is sparse. Typically r n /n 0 so θ is not L 2 -consistent. The reason (ii) is interesting is that n 1 E Z θ Z θ 2 = n 1 E X Z θ 2 σ 2, where X is an independent copy of X and n 1 E X Z θ 2 is the average prediction mean squared error. Problem of the ridge regression estimator When p < n, θ = β has many zero components, the ridge regression estimator does not have any zero components, although it has many small components. UW-Madison (Statistics) Stat 709 Lecture / 16

10 LASSO estimator Consider linear model X = Z β + ε, β R p and Var(ε) = σ 2 I n. The ridge regression estimator of β is obtained from min β R p( X Z β 2 + λ β 2 ) If we change the L 2 penalty β 2 to the L 1 penalty β 1 = p j=1 β j, where β j is the jth component of β, then the LASSO estimator is from min β R p( X Z β 2 + λ β 1 ) Difference between LASSO and ridge regression: LASSO estimator does not have an explicit form. When a component of β is 0, its LASSO estimator may be 0, but its ridge regression estimator is never 0. The minimization for LASSO is still for a convex objective function, but the objective function is not always differentiable. Although LASSO is still defined when p > n, it is usually used in the case where p < n. If p < n, Z can be deterministic or random. UW-Madison (Statistics) Stat 709 Lecture / 16

11 Notation A = the set of indices of non-zero coefficients of β β = (β ( A,β A c), dim(β ) A ) ( = q, dim(β A c) = p q; X = (X A,X A c) C11 C C = 12 = 1 X τ A X A XA τ X ) A c C 21 C n 22 XA τ X c A XA τ = 1 X c A c n X τ X Consistency The LASSO estimator β of β is strongly sign consistent if there exists λ = λ n not depending on Y or X such that ( ) lim P sign( β) = sign(β) = 1 n which implies variable selection consistent (since sign(a) = 0 if a = 0), ( ) lim P A = A = 1 n where A is the index set of nonzero components of β. Strong Irrepresentable Condition (SIC) There exists a vector η whose components are positive such that C 21 C 1 11 sign(β A ) 1 η component-wise, where a = ( a 1, a 2,...) for a = (a 1,a 2,...) and 1 is the vector of ones. UW-Madison (Statistics) Stat 709 Lecture / 16

12 Critical Lemma Under the SIC, ( ) P sign( β) = sign(β) P(A n B n ), where { A n = B n = C 1 11 W A < n β A λ n 2 n C 1 } { C 21 C 1 11 W A W A c λ n 2 n η W A = 1 n X τ A ε W A c = 1 n X τ A cε } 11 sign(β A ) Karush-Kuhn-Tuker (KKT) condition β = ( β 1,..., β p ) is the LASSO estimator if and only if Y Xβ 2 λsign( β j ) βj 0 β = j βj = β j bounded by λ in absolute value βj = 0 UW-Madison (Statistics) Stat 709 Lecture / 16

13 Proof of the Lamma Let û = β β and V n (u) = n i=1 [(ε i X i u) 2 ε i ] 2 + λ n u + β 1 Then û = argminv n (u) It can be verified that the KKT condition is equivalent to C 11 ( nû A ) W A = λ n 2 n sign(β A ), (3) λ n 2 n 1 C 21( nû A ) W A c λ n 2 1, n (4) û A < β A (5) We now show that on A n B n, a solution û satisfying (3) and û A c = 0 must satisfy (4) and (5), and hence β = û + β is a LASSO estimator. In fact, LASSO estimator is unique. First, (3) and A n holds imply (5). Second, (3) and B n holds and the SIC imply (4). Finally, a sufficient condition for sign( β) = sign(β) is û A < β A and û A c = 0. This proves that if A n B n holds, sign( β) = sign(β). UW-Madison (Statistics) Stat 709 Lecture / 16

14 Theorem (strong sign consistency of LASSO) (i) Assume that ε i s are iid with E(εi 2k ) < for an integer k > 0, and there are positive constants c 1 < c 2 1, M 1, M 2, M 3, such that C1: n 1 Z j 2 M 1 for any j = 1,...,p, Z j is the jth column of Z ; C2: The smallest eignvalue of C 11 M 2 ; C3: q = O(n c 1); C4: n (1 c 2)/2 min j A β j M 3 ; C5: p = o(n (c 2 c 1 )k ). Under SIC, if λ is chosen with λ = o(n 1+c 2 c 1 )/2 ) and pn k /λ 2k = o(1), then ( ) P sign( β) = sign(β) 1 O(pn k /λ 2k ) (ii) Assume that ε i s are iid normal and C1-C4 hold, and C5a: p = O(e nc3 ) with a constant c 3, 0 c 3 < c 2 c 1. Under SIC, if λ is chosen with λ n (1+c4)/2, c 4 is a constant, c 3 < c 4 < c 2 c 1, then ( ) P sign( β) = sign(β) 1 O(e nc3 ) UW-Madison (Statistics) Stat 709 Lecture / 16

15 Proof. z j = the jth component of C 1 11 W A, j = 1,...,q ζ j = the jth component of C 21 C 1 11 W A W A c, j = 1,...,p q b j = the jth component of C 1 The condition E(εi 2k By the lemma, P ( sign( β) sign(β) 11 sign(β A ), j = 1,...,q ) < implies that E(z 2k j ) 1 P(A n B n ) ) < and E(ζ 2k j P ( z j n β j λb j /2 n ) j A + P ( ζ j λη j /2 n ) j A c E z j 2k E ζ j j A n k βj 2k + 2k j A (2λη c j ) 2k /n k = qo(n kc 2 ) + (p q)o(n k /λ 2k ) ) < = o(pn k /λ 2k ) + O(pn k /λ 2k ) = O(pn k /λ 2k ) This proves (i). UW-Madison (Statistics) Stat 709 Lecture / 16

16 For (ii), the normality of ε j implies that z j and ζ j are normal. Instead of using Markov inequality, using 1 Φ(t) t 1 e t2 /2 leads to the result (ii). Advantage and disadvantage of using LASSO Variable selection and parameter estimation at the same time It is very good in estimation and prediction, but it is often too conservative in variable selection. Need SIC. Population version of SIC. Σ 21 Σ 1 11 sign(β A ) 1 η, Σ kj are submatrices of Σ = Var(z j ), if z j s are iid, z j is the jth row of Z. Improvements Adaptive LASSO Group LASSO Elastic net (other penalties) LASSO plus thresholding (ridge regression plus threshodling) UW-Madison (Statistics) Stat 709 Lecture / 16