Lecture 34: Ridge regression and LASSO

Σχετικά έγγραφα
Lecture 21: Properties and robustness of LSE

6.3 Forecasting ARMA processes

ST5224: Advanced Statistical Theory II

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Example Sheet 3 Solutions

2 Composition. Invertible Mappings

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Statistical Inference I Locally most powerful tests

EE512: Error Control Coding

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Other Test Constructions: Likelihood Ratio & Bayes Tests

C.S. 430 Assignment 6, Sample Solutions

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Matrices and Determinants

Lecture 34 Bootstrap confidence intervals

Every set of first-order formulas is equivalent to an independent set

Uniform Convergence of Fourier Series Michael Taylor

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Homework 3 Solutions

Finite Field Problems: Solutions

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Lecture 12: Pseudo likelihood approach

Solution Series 9. i=1 x i and i=1 x i.

2. Let H 1 and H 2 be Hilbert spaces and let T : H 1 H 2 be a bounded linear operator. Prove that [T (H 1 )] = N (T ). (6p)

Reminders: linear functions

Tridiagonal matrices. Gérard MEURANT. October, 2008

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

w o = R 1 p. (1) R = p =. = 1

4.6 Autoregressive Moving Average Model ARMA(1,1)

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Inverse trigonometric functions & General Solution of Trigonometric Equations

Math221: HW# 1 solutions

Problem Set 3: Solutions

Lecture 15 - Root System Axiomatics

forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with

Approximation of distance between locations on earth given by latitude and longitude

Second Order Partial Differential Equations

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Arithmetical applications of lagrangian interpolation. Tanguy Rivoal. Institut Fourier CNRS and Université de Grenoble 1

Nondifferentiable Convex Functions

The Simply Typed Lambda Calculus

Section 8.3 Trigonometric Equations

ORDINAL ARITHMETIC JULIAN J. SCHLÖDER

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Econ Spring 2004 Instructor: Prof. Kiefer Solution to Problem set # 5. γ (0)

Solutions to Exercise Sheet 5

Math 446 Homework 3 Solutions. (1). (i): Reverse triangle inequality for metrics: Let (X, d) be a metric space and let x, y, z X.

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Congruence Classes of Invertible Matrices of Order 3 over F 2

Lecture 13 - Root Space Decomposition II

derivation of the Laplacian from rectangular to spherical coordinates

Concrete Mathematics Exercises from 30 September 2016

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Numerical Analysis FMN011

Optimal Parameter in Hermitian and Skew-Hermitian Splitting Method for Certain Two-by-Two Block Matrices

Section 9.2 Polar Equations and Graphs

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

The challenges of non-stable predicates

Section 7.6 Double and Half Angle Formulas

Homework 8 Model Solution Section

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Μηχανική Μάθηση Hypothesis Testing

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

Lecture 10 - Representation Theory III: Theory of Weights

Introduction to the ML Estimation of ARMA processes

5. Choice under Uncertainty

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Pg The perimeter is P = 3x The area of a triangle is. where b is the base, h is the height. In our case b = x, then the area is

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

Areas and Lengths in Polar Coordinates

A Note on Intuitionistic Fuzzy. Equivalence Relation

Lecture 2. Soundness and completeness of propositional logic

New bounds for spherical two-distance sets and equiangular lines

6. MAXIMUM LIKELIHOOD ESTIMATION

Areas and Lengths in Polar Coordinates

Partial Differential Equations in Biology The boundary element method. March 26, 2013

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

MA 342N Assignment 1 Due 24 February 2016

The ε-pseudospectrum of a Matrix

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

MATH1030 Linear Algebra Assignment 5 Solution

Homomorphism in Intuitionistic Fuzzy Automata

Survival Analysis: One-Sample Problem /Two-Sample Problem/Regression. Lu Tian and Richard Olshen Stanford University

Bounding Nonsplitting Enumeration Degrees

= {{D α, D α }, D α }. = [D α, 4iσ µ α α D α µ ] = 4iσ µ α α [Dα, D α ] µ.

Abstract Storage Devices

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Srednicki Chapter 55

Transcript:

Lecture 34: Ridge regression and LASSO Ridge regression Consider linear model X = Z β + ε, β R p and Var(ε) = σ 2 I n. The LSE is obtained from the minimization problem min X Z β R β 2 (1) p A type of shrinkage estimator is obtained though (1) by adding a penalty on β 2, i.e., min β R p( X Z β 2 + λ β 2 ) (2) where λ 0 is a constant controlling the penalization. β ( X Z β 2 + λ β 2 ) = 2Z τ (X Z β) + 2λβ which gives the solution to (2) as β λ = (Z τ Z + λi p ) 1 Z τ X This estimator is better than the LSE when Z τ Z is nearly singular. UW-Madison (Statistics) Stat 709 Lecture 23 2018 1 / 16

This gives a class of estimators called ridge regression estimators; in particular, λ = 0 gives the LSE. Bias and covariance matrix E( β λ ) = (Z τ Z + λi p ) 1 Z τ E(X) = (Z τ Z + λi p ) 1 Z τ Z β The bias of β λ is then b(β) = (Z τ Z + λi p ) 1 Z τ Z β β = λ(z τ Z + λi p ) 1 β The bias is not 0, but converges to 0 as λ 0. Var( β λ ) = (Z τ Z + λi p ) 1 Z τ Var(X)Z (Z τ Z + λi p ) 1 = σ 2 (Z τ Z + λi p ) 1 Z τ Z (Z τ Z + λi p ) 1 = σ 2 (Z τ Z + λi p ) 1 σ 2 λ(z τ Z + λi p ) 2 It can be seen that the variance converges to 0 if λ and to σ 2 (Z τ Z ) 1 if λ 0. Combining the bias and variance, we get E β λ β 2 = b(β) 2 + E β λ E (β λ ) 2 = λ 2 (Z τ Z + λi p ) 1 β 2 + σ 2 tr[z τ Z (Z τ Z + λi p ) 2 ] UW-Madison (Statistics) Stat 709 Lecture 23 2018 2 / 16

Theorem (Comparison between ridge regression and LSE) Let β = β 0 be the LSE. (i) If 0 < λ < 2σ 2 / β 2, then E β λ β 2 < E β β 2. (ii) Assume that the smallest eigenvalue of Z τ Z = O(n). If λ > 2σ 2 / β 2, then E β λ β 2 > E β β 2 for sufficiently large n; if λ = 2σ 2 / β 2, then E β λ β 2 = E β β 2 + O(n 3 ). Proof. Let Then Hence A = σ 2 (Z τ Z ) 1 σ 2 (Z τ Z + λi p ) 1 Z τ Z (Z τ Z + λi p ) 1 λ 2 (Z τ Z + λi p ) 1 ββ τ (Z τ Z + λi p ) 1 (Z τ Z + λi p )A(Z τ Z + λi p ) = σ 2 (Z τ Z + λi p )(Z τ Z ) 1 (Z τ Z + λi p ) σ 2 Z τ Z λ 2 ββ τ = 2λσ 2 I p + λ 2 σ 2 (Z τ Z ) 1 λ 2 ββ τ A = (Z τ Z + λi p ) 1 [2λσ 2 I p λ 2 ββ τ + λ 2 σ 2 (Z τ Z ) 1 ](Z τ Z + λi p ) 1 UW-Madison (Statistics) Stat 709 Lecture 23 2018 3 / 16

Assume λ > 0 and β 0. Then A > λ 2 σ 2 (Z τ Z + λi p ) 1 (Z τ Z ) 1 (Z τ Z + λi p ) 1 if and only if 2σ 2 λ 1 I p ββ τ > 0 equivalent to λ < 2σ 2 / β 2 This can be shown as follows. If 2σ 2 λ 1 I p ββ τ > 0, then 0 < β τ (2σ 2 λ 1 I p ββ τ )β = 2σ 2 λ 1 β 2 β 4, which means λ < 2σ 2 / β 2. On the other hand, if λ < 2σ 2 / β 2, then (2σ 2 λ 1 I p ββ τ )/ β 2 = (2σ 2 λ 1 β 2 1)I p + I p ββ τ / β 2 > 0, because I p ββ τ / β 2 is a projection matrix whose eigenvalues are either 0 or 1. Since Var( β) = σ 2 (Z τ Z ) 1, using the formula for Var( β λ ) we obtain E β β 2 E β λ β 2 = tr(a) Thus, (i) follows, and (ii) and (iii) follow from λ 2 σ 2 (Z τ Z + λi p ) 1 (Z τ Z ) 1 (Z τ Z + λi p ) 1 λ 2 σ 2 (Z τ Z ) 3 The ridge regression is better if the noise to signal ratio is large. UW-Madison (Statistics) Stat 709 Lecture 23 2018 4 / 16

High dimension problems The dimension of β in a linear model is p (Z is n p) In traditional applications: p << n; p is fixed when n. In modern applications, p is large; p = p n increases as n increases. p = O(n k ): polynomial-type divergence rate p = O(e nν ): ultra-high dimension, where ν is a constant < 1. Non-identifiability of β r = r n : rank of Z. The dimension of R(Z ) is r n. If p > n, then β is not identifiable. This means that there are β and β, β β but Z β = Z β so that the data generated under the models with β and β are the same. It is not possible to estimate all components of β consistently; we are not able to estimate something out of the data range. We can estimate consistently some useful functions of β. We can estimate the projection of β onto R(Z ). Estimation of the projection is sufficient for many problems UW-Madison (Statistics) Stat 709 Lecture 23 2018 5 / 16

Projection Singular value decomposition: Z = PDQ τ P: n r matrix with P τ P = I r (identity matrix) Q: p r matrix with Q τ Q = I r D: r r diagonal matrix of full rank Projection of β onto R(Z ): θ = Z τ (ZZ τ ) Z β = QQ τ β R(Z ) Z θ = PDQ τ (QQ τ β) = PDQ τ β = Z β The model Y = Z β + ε is the same as Y = Z θ + ε Ridge regression estimator of θ θ = (Z τ Z + h n I p ) 1 Z τ X h n > 0 We only need to invert an n n matrix, because (Z τ Z + h n I p ) 1 Z τ = Z τ (ZZ τ + h n I n ) 1 θ is always in R(Z ) UW-Madison (Statistics) Stat 709 Lecture 23 2018 6 / 16

Derivation of the bias of ridge regression estimator Let Γ = ( Q Q ), Q τ Q = 0, ΓΓ τ = Γ τ Γ = I p. Then bias( θ) = E( θ) θ = (Z τ Z + h n I p ) 1 Z τ Z θ θ = (hn 1 Z τ Z + I p ) 1 θ = Γ(hn 1 Γ τ Z τ Z Γ + I p ) 1 Γ τ QQ τ θ = ( ) ( (hn Q Q 1 D 2 + I r ) 1 0 0 I p r = ( Q(h 1 n D 2 + I r ) 1 Q ) ( Q τ θ 0 = Q(hn 1 D 2 + I r ) 1 Q τ θ (1 + d 1n /h n ) 1 = Q... ) )( Q τ (1 + d rn /h n ) 1 Q τ ) QQ τ θ Q τ θ where d jn > 0 is the jth diagonal element of D 2 (eigenvalue of Z τ Z ). UW-Madison (Statistics) Stat 709 Lecture 23 2018 7 / 16

Thus, For the variance, bias( θ) 2 = θ τ Q(hn 1 D 2 + I r ) 2 Q τ θ max 1 j r (1 + d jn/h n ) 2 θ τ QQ τ θ h 2 nd 2 1n θ 2 Var( θ) = σ 2 (Z τ Z + h n I p ) 1 Z τ Z (Z τ Z + h n I p ) 1 σ 2 hn 1 I p Theorem (Consistency of θ) Assume that (C1) d 1 1n = O(n η ), η 1 and η does not depend on n. (C2) θ = O(n τ ), τ < η and τ does not depend on n. Then (i) As n, E(l τ θ l τ θ) 2 = O(hn 1 ) + O(hnn 2 2(η τ) ) uniformly over p-dimensional deterministic vector l with l = 1. (ii) n 1 E Z θ Z θ 2 = O(r n n 1 ) + O(h 2 nn (1+η 2τ) ). UW-Madison (Statistics) Stat 709 Lecture 23 2018 8 / 16

Remarks (C2) means that θ is sparse; without any condition, the order of θ 2 could be p. θ β so that (C2) holds if β is sparse. For any fixed l θ, l θ is consistent if hn and h n n (η τ) 0. θ is not sparse even if θ is sparse. Typically r n /n 0 so θ is not L 2 -consistent. The reason (ii) is interesting is that n 1 E Z θ Z θ 2 = n 1 E X Z θ 2 σ 2, where X is an independent copy of X and n 1 E X Z θ 2 is the average prediction mean squared error. Problem of the ridge regression estimator When p < n, θ = β has many zero components, the ridge regression estimator does not have any zero components, although it has many small components. UW-Madison (Statistics) Stat 709 Lecture 23 2018 9 / 16

LASSO estimator Consider linear model X = Z β + ε, β R p and Var(ε) = σ 2 I n. The ridge regression estimator of β is obtained from min β R p( X Z β 2 + λ β 2 ) If we change the L 2 penalty β 2 to the L 1 penalty β 1 = p j=1 β j, where β j is the jth component of β, then the LASSO estimator is from min β R p( X Z β 2 + λ β 1 ) Difference between LASSO and ridge regression: LASSO estimator does not have an explicit form. When a component of β is 0, its LASSO estimator may be 0, but its ridge regression estimator is never 0. The minimization for LASSO is still for a convex objective function, but the objective function is not always differentiable. Although LASSO is still defined when p > n, it is usually used in the case where p < n. If p < n, Z can be deterministic or random. UW-Madison (Statistics) Stat 709 Lecture 23 2018 10 / 16

Notation A = the set of indices of non-zero coefficients of β β = (β ( A,β A c), dim(β ) A ) ( = q, dim(β A c) = p q; X = (X A,X A c) C11 C C = 12 = 1 X τ A X A XA τ X ) A c C 21 C n 22 XA τ X c A XA τ = 1 X c A c n X τ X Consistency The LASSO estimator β of β is strongly sign consistent if there exists λ = λ n not depending on Y or X such that ( ) lim P sign( β) = sign(β) = 1 n which implies variable selection consistent (since sign(a) = 0 if a = 0), ( ) lim P A = A = 1 n where A is the index set of nonzero components of β. Strong Irrepresentable Condition (SIC) There exists a vector η whose components are positive such that C 21 C 1 11 sign(β A ) 1 η component-wise, where a = ( a 1, a 2,...) for a = (a 1,a 2,...) and 1 is the vector of ones. UW-Madison (Statistics) Stat 709 Lecture 23 2018 11 / 16

Critical Lemma Under the SIC, ( ) P sign( β) = sign(β) P(A n B n ), where { A n = B n = C 1 11 W A < n β A λ n 2 n C 1 } { C 21 C 1 11 W A W A c λ n 2 n η W A = 1 n X τ A ε W A c = 1 n X τ A cε } 11 sign(β A ) Karush-Kuhn-Tuker (KKT) condition β = ( β 1,..., β p ) is the LASSO estimator if and only if Y Xβ 2 λsign( β j ) βj 0 β = j βj = β j bounded by λ in absolute value βj = 0 UW-Madison (Statistics) Stat 709 Lecture 23 2018 12 / 16

Proof of the Lamma Let û = β β and V n (u) = n i=1 [(ε i X i u) 2 ε i ] 2 + λ n u + β 1 Then û = argminv n (u) It can be verified that the KKT condition is equivalent to C 11 ( nû A ) W A = λ n 2 n sign(β A ), (3) λ n 2 n 1 C 21( nû A ) W A c λ n 2 1, n (4) û A < β A (5) We now show that on A n B n, a solution û satisfying (3) and û A c = 0 must satisfy (4) and (5), and hence β = û + β is a LASSO estimator. In fact, LASSO estimator is unique. First, (3) and A n holds imply (5). Second, (3) and B n holds and the SIC imply (4). Finally, a sufficient condition for sign( β) = sign(β) is û A < β A and û A c = 0. This proves that if A n B n holds, sign( β) = sign(β). UW-Madison (Statistics) Stat 709 Lecture 23 2018 13 / 16

Theorem (strong sign consistency of LASSO) (i) Assume that ε i s are iid with E(εi 2k ) < for an integer k > 0, and there are positive constants c 1 < c 2 1, M 1, M 2, M 3, such that C1: n 1 Z j 2 M 1 for any j = 1,...,p, Z j is the jth column of Z ; C2: The smallest eignvalue of C 11 M 2 ; C3: q = O(n c 1); C4: n (1 c 2)/2 min j A β j M 3 ; C5: p = o(n (c 2 c 1 )k ). Under SIC, if λ is chosen with λ = o(n 1+c 2 c 1 )/2 ) and pn k /λ 2k = o(1), then ( ) P sign( β) = sign(β) 1 O(pn k /λ 2k ) (ii) Assume that ε i s are iid normal and C1-C4 hold, and C5a: p = O(e nc3 ) with a constant c 3, 0 c 3 < c 2 c 1. Under SIC, if λ is chosen with λ n (1+c4)/2, c 4 is a constant, c 3 < c 4 < c 2 c 1, then ( ) P sign( β) = sign(β) 1 O(e nc3 ) UW-Madison (Statistics) Stat 709 Lecture 23 2018 14 / 16

Proof. z j = the jth component of C 1 11 W A, j = 1,...,q ζ j = the jth component of C 21 C 1 11 W A W A c, j = 1,...,p q b j = the jth component of C 1 The condition E(εi 2k By the lemma, P ( sign( β) sign(β) 11 sign(β A ), j = 1,...,q ) < implies that E(z 2k j ) 1 P(A n B n ) ) < and E(ζ 2k j P ( z j n β j λb j /2 n ) j A + P ( ζ j λη j /2 n ) j A c E z j 2k E ζ j j A n k βj 2k + 2k j A (2λη c j ) 2k /n k = qo(n kc 2 ) + (p q)o(n k /λ 2k ) ) < = o(pn k /λ 2k ) + O(pn k /λ 2k ) = O(pn k /λ 2k ) This proves (i). UW-Madison (Statistics) Stat 709 Lecture 23 2018 15 / 16

For (ii), the normality of ε j implies that z j and ζ j are normal. Instead of using Markov inequality, using 1 Φ(t) t 1 e t2 /2 leads to the result (ii). Advantage and disadvantage of using LASSO Variable selection and parameter estimation at the same time It is very good in estimation and prediction, but it is often too conservative in variable selection. Need SIC. Population version of SIC. Σ 21 Σ 1 11 sign(β A ) 1 η, Σ kj are submatrices of Σ = Var(z j ), if z j s are iid, z j is the jth row of Z. Improvements Adaptive LASSO Group LASSO Elastic net (other penalties) LASSO plus thresholding (ridge regression plus threshodling) UW-Madison (Statistics) Stat 709 Lecture 23 2018 16 / 16