Homework 1: Solutions

Σχετικά έγγραφα
SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Example Sheet 3 Solutions

Homework 3 Solutions

6.3 Forecasting ARMA processes

Finite Field Problems: Solutions

Statistical Inference I Locally most powerful tests

2 Composition. Invertible Mappings

Other Test Constructions: Likelihood Ratio & Bayes Tests

Tridiagonal matrices. Gérard MEURANT. October, 2008

C.S. 430 Assignment 6, Sample Solutions

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Matrices and Determinants

Every set of first-order formulas is equivalent to an independent set

EE512: Error Control Coding

Problem Set 3: Solutions

4.6 Autoregressive Moving Average Model ARMA(1,1)

ST5224: Advanced Statistical Theory II

= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Inverse trigonometric functions & General Solution of Trigonometric Equations

Lecture 21: Properties and robustness of LSE

The Simply Typed Lambda Calculus

Section 8.3 Trigonometric Equations

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Congruence Classes of Invertible Matrices of Order 3 over F 2

Solutions to Exercise Sheet 5

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Math221: HW# 1 solutions

The ε-pseudospectrum of a Matrix

Srednicki Chapter 55

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Homework 8 Model Solution Section

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Section 7.6 Double and Half Angle Formulas

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Homework for 1/27 Due 2/5

w o = R 1 p. (1) R = p =. = 1

( ) 2 and compare to M.

Uniform Convergence of Fourier Series Michael Taylor

Solution Series 9. i=1 x i and i=1 x i.

derivation of the Laplacian from rectangular to spherical coordinates

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Concrete Mathematics Exercises from 30 September 2016

6. MAXIMUM LIKELIHOOD ESTIMATION

Numerical Analysis FMN011

Approximation of distance between locations on earth given by latitude and longitude

Lecture 34 Bootstrap confidence intervals

Lecture 34: Ridge regression and LASSO

Nondifferentiable Convex Functions

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Lecture 15 - Root System Axiomatics

ORDINAL ARITHMETIC JULIAN J. SCHLÖDER

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Reminders: linear functions

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

A Note on Intuitionistic Fuzzy. Equivalence Relation

Mean-Variance Analysis

5.4 The Poisson Distribution.

Econ Spring 2004 Instructor: Prof. Kiefer Solution to Problem set # 5. γ (0)

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

Math 6 SL Probability Distributions Practice Test Mark Scheme

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Math 446 Homework 3 Solutions. (1). (i): Reverse triangle inequality for metrics: Let (X, d) be a metric space and let x, y, z X.

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

Figure A.2: MPC and MPCP Age Profiles (estimating ρ, ρ = 2, φ = 0.03)..

CRASH COURSE IN PRECALCULUS

Μηχανική Μάθηση Hypothesis Testing

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

The challenges of non-stable predicates

Section 9.2 Polar Equations and Graphs

Second Order Partial Differential Equations

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

Higher Derivative Gravity Theories

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

MA 342N Assignment 1 Due 24 February 2016

Partial Differential Equations in Biology The boundary element method. March 26, 2013

2. Let H 1 and H 2 be Hilbert spaces and let T : H 1 H 2 be a bounded linear operator. Prove that [T (H 1 )] = N (T ). (6p)

Exercises to Statistics of Material Fatigue No. 5

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

New bounds for spherical two-distance sets and equiangular lines

forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with

Quadratic Expressions

ΗΜΥ 220: ΣΗΜΑΤΑ ΚΑΙ ΣΥΣΤΗΜΑΤΑ Ι Ακαδημαϊκό έτος Εαρινό Εξάμηνο Κατ οίκον εργασία αρ. 2

5. Choice under Uncertainty

Bounding Nonsplitting Enumeration Degrees

Solution to Review Problems for Midterm III

Exercise 2: The form of the generalized likelihood ratio

Areas and Lengths in Polar Coordinates

MATH423 String Theory Solutions 4. = 0 τ = f(s). (1) dτ ds = dxµ dτ f (s) (2) dτ 2 [f (s)] 2 + dxµ. dτ f (s) (3)

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Coefficient Inequalities for a New Subclass of K-uniformly Convex Functions

Transcript:

Homework : Solutions Statistics 63 Fall 207 Data Analysis: ote: All data analysis results are provided by Xuyan Lu. Baseball Data: a What are the most important features for predicting player s salary? i Fit and visualize regularization paths for the following methods: ˆ Lasso: Coefficients 0.4 0.2 0.0 0.2 0.4 0.6 Lambda 0.4 0.08 0.0025 0.00034 4.5e 05 Hits CRuns Years Walks CAtBat League HmRun CHits CWalks AtBat 2 4 6 8 0 Log Lambda ## [] "Hits" "Walks" "Years" "CRuns" "PutOuts" The top 5 predictors selected are: Hits, Walks, Years CRuns and PutOuts.

ˆ Elastic et: For α = 0.4: Lambda 0.4 0.08 0.0025 0.00034 Hits Coefficients 0.4 0.2 0.0 0.2 0.4 CRuns Years Walks CAtBat League HmRun PutOuts CWalks AtBat 0 2 4 6 8 Log Lambda ## [] "Hits" "Walks" "CAtBat" "CHits" "CRuns" The top 5 predictors selected are: CRuns, Hits, CAtBat, Walks and CHits. 2

For α = 0.8: Coefficients 0.4 0.2 0.0 0.2 0.4 0.6 Lambda 0.4 0.08 0.0025 0.00034 Hits CRuns Years Walks CAtBat League HmRun CHits CWalks AtBat 2 4 6 8 Log Lambda ## [] "Hits" "Walks" "Years" "CHits" "CRuns" The top 5 predictors selected are: CRuns, Hits, Years, Walks and CHits. 3

ˆ Adaptive Lasso: I choose to use LS solution and γ = to get the weights: ŵ = ˆβ OLS. Lambda 0.4 0.08 0.0025 0.00034 Hits Coefficients 0.4 0.2 0.0 0.2 0.4 CRuns Years Walks CAtBat League HmRun CHits CWalks AtBat 0 2 4 6 8 Log Lambda ## [] "Hits" "Walks" "Years" "CHits" "CRuns" The top 5 predictors selected are: CRuns, Hits, Years, Walks and CHits. 4

ˆ SCAD: 0.6 0.4 0.2 β^ 0.0 0.2 0.4 0.5 0.4 0.3 0.2 0. 0.0 λ ## [] "Hits" "Walks" "CHits" "DivisionW" "PutOuts" The top 5 predictors selected are: Hits, Walks, CHits, DivisionW, PutOuts. 5

ˆ MC+: 0.6 0.4 0.2 β^ 0.0 0.2 0.4 0.5 0.4 0.3 0.2 0. 0.0 λ ## [] "Hits" "Walks" "CHits" "DivisionW" "PutOuts" The top 5 predictors selected are: Hits, Walks, CHits, DivisionW, PutOuts. ii What are the top predictors selected by each method? Are they different? If so, why? I count the munber of predictors selected by each model and get the top selected predictors of each model, the result is given in the previous question. The top predictors selected by all models are: *Hits* and *Walks*. But other top selected predictors are not the same, this is for the reason that the penalty terms for different models are not the same. For the convex model Lasso, Elastic Lasso, Adaptive Lasso the top selected predictors are very similar, and for non-convex model SCAD, MCP the top selected predictors are very similar. b Which linear method is best at predicting player s salary? i Compare the average prediction MSE on the test set for the following methods: LS Ridge Best Subset Lasso Aeveraged MSE 0.446366 0.389322 0.434976 0.3976438 Elastic et Adaptive Lasso SCAD MCP Aeveraged MSE 0.3965297 0.405767 0.403857 0.40577 The averaged prediction MSE is shown in the table above. Ridge, Lasso and Elastic et gives relatively good prediction, Least Square perform the worst among all methods. ii Visualize the results of your comparisons 6

0.3 0.4 0.5 0.6 LS Ridge B.S. Lasso E.. A.L. SCAD MC+ I did the boxplot of the MSE s of all method of 0 times 0 MSE s for each method and got the plot above. iii Reflection. From the boxplot above we can see that Ridge gives the best prediction error, and Lasso and Elastic et also gives good prediction error. These methods doing good is because they are shrinkage method and add constrains to ˆβ to prevent it from overfitting the training set. So these methods often have lower MSE compare to the Least Square. Least Square don t have constrains on ˆβ, so it is very easy to overfit the model, especially when there are too mutch features. Best Subset is a discreste process, so it often exibits high variace, which will result in bad performance on test set. ot all the methods choose the same subset of variables. Least Square and Ridge use all the variables but others only use some of them. The difference of subsets is due to the difference of penalty terms. 7

Math Problems: 2. We consider the problem y = Xβ + ɛ where y R n, β R p, X R n p, ɛ 0, I n n and n p with X full-rank. Recall that, for a general linear estimator, Xβ = X ˆβ, of Xβ: [ MSEX ˆβ = E X ˆβ Xβ X ˆβ ] Xβ [ = E X ˆβ E[X ˆβ] + E[X ˆβ] Xβ X ˆβ E[X ˆβ] + E[X ˆβ] ] Xβ [ = E X ˆβ E[X ˆβ] + E[X ˆβ] T Xβ X ˆβ E[X ˆβ] + E[X ˆβ] ] Xβ [ = E X ˆβ E[X ˆβ] X ˆβ E[X ˆβ] + X ˆβ E[X ˆβ] E[X ˆβ] Xβ + E[X ˆβ] Xβ X ˆβ E[X ˆβ] + E[X ˆβ] Xβ E[X ˆβ] ] Xβ [ = E X ˆβ E[X ˆβ] X ˆβ ˆβ] ] [ E[X + E X ˆβ E[X ˆβ] E[X ˆβ] ] Xβ [ + E E[X ˆβ] Xβ X ˆβ ˆβ] ] [ E[X + E E[X ˆβ] Xβ E[X ˆβ] ] Xβ [ = TrVarX ˆβ + E X ˆβ ˆβ] ] T E[X E[X ˆβ] Xβ + E[X ˆβ] T [ Xβ E X ˆβ E[X ˆβ] ] + E[ BiasX ˆβ 2 2] = TrVarX ˆβ + E[X ˆβ] E[X ˆβ] E[X ˆβ] Xβ = TrVarX ˆβ + 0 E[X ˆβ] Xβ = TrVarX ˆβ + BiasX ˆβ 2 2 For ˆβ OLS, we know ˆβ OLS is unbiased E[ ˆβ OLS ] = β so + + E[X ˆβ] Xβ E[X ˆβ] E[X ˆβ] + BiasX ˆβ 2 2 E[X ˆβ] Xβ T 0 + BiasX ˆβ 2 2 BiasX ˆβ OLS = E[X ˆβ OLS Xβ] = XE[ ˆβ OLS β] = X Bias ˆβ OLS = X 0 = 0 Hence the MSE is a function of the variance only: Var ˆβ OLS = VarX T X X T y = X T X X T Vary X T X X T T = X T X X T σ 2 I n XX T X = σ 2 X T X X T XX T X = σ 2 X T X = VarX ˆβ OLS = X Var ˆβ OLS X T = σ 2 XX T X X T MSEX ˆβ OLS = TrVarX ˆβ OLS + BiasX ˆβ OLS 2 2 = Trσ 2 XX T X X T + 0 2 2 = Trσ 2 X T X X T X + 0 = Trσ 2 I p p = σ 2 p If the OLS solution is unique, X T X is a real, symmetric, positive-definite matrix. 8

For ˆβ Ridge λ, we have to analyze both terms. We first note: [ ] E ˆβRidge λ = E [ X T X + λi X T y ] = X T X + λi X T E[y] = X T X + λi X T Xβ = X T X + λi X T X + λi λiβ = X T X + λi X T X + λiβ X T X + λi λiβ = β λx T X + λi β so giving and E[X ˆβ Ridge λ] = Xβ XλX T X + λi β BiasX ˆβ Ridge λ 2 2 = E[X ˆβ Ridge λ] Xβ E[X ˆβ Ridge λ] Xβ = λxx T X + λi β λxx T X + λi β = λ 2 β T X T X + λi X T XX T X + λi β Var ˆβ Ridge λ = Var X T X + λi X T y = X T X + λi X T Vary X T X + λi X T T = X T X + λi X T σ 2 IXX T X + λi = σ 2 X T X + λi X T XX T X + λi = VarX ˆβ Ridge λ = X Var ˆβ Ridge λx T = σ 2 XX T X + λi X T XX T X + λi X T giving MSEX ˆβ Ridge λ = Tr VarX ˆβ Ridge λ + BiasX ˆβ Ridge λ 2 = Tr σ 2 XX T X + λi X T XX T X + λi X T + λ 2 β T X T X + λi X T XX T X + λi β = Tr σ 2 X T X + λi X T XX T X + λi X T X + λ 2 β T X T X + λi X T XX T X + λi β [ ] 2 = σ 2 Tr X T X + λi X T X + λ 2 β T X T X + λi X T XX T X + λi β As expected, if λ = 0, we recover MSE ˆβ OLS from above. To simplify MSEX ˆβ Ridge λ further, we will take the eigendecomposition of X T X = P T DP where ˆ P R p p is an orthogonal matrix P T P = P P T = I p p ˆ D R p p is a diagonal matrix with all strictly positive elements ote that, with this decomposition X T X + λi can be simplified: X T X + λi = P T DP + λp T P = P T D + λip = P T D + λi P where D + λi is a diagonal matrix with elements λ i+λ where λ i} eigenvalues of X T X. 9

We use this to simplify the expression above: MSEX ˆβ Ridge λ = σ 2 Tr [P T D + λi P P T DP ] 2 + λ 2 β T P T D + λi P P T DP P T D + λi P β = σ 2 Tr [P T D + λi DP ] 2 + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr P T D + λi DP P T D + λi DP + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr P T D + λi DD + λi DP + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr D + λi DD + λi DP P T + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr D + λi DD + λi D + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr + λ 2 β T P T λ i P β = σ 2 = i= i= λ 2 i λ+λ i 2 i λ 2 i λ + λ i 2 + λ2 σ 2 λ 2 i + λ2 λ i P β 2 i λ + λ i 2 for some fixed but unknown P β i. As before, note that, if λ = 0, we simply get: i= P β 2 i λ i λ + λ i 2 λ i+λ 2 i MSEX ˆβ Ridge 0 = i= σ 2 λ 2 i + 0 λ i P β 2 i 0 + λ 2 i = i= σ 2 λ 2 i λ 2 i = σ 2 p which is MSEX ˆβ OLS. Since MSEX ˆβ Ridge λ = MSEX ˆβ OLS at λ = 0, we know that a λ with lower MSE must exist if the derivative of MSEX ˆβ Ridge λ with respect to λ is negative at λ = 0. We differentiate with respect to λ: d MSEX ˆβ Ridge λ dλ = = = = i= i= i= i= λ + λ i 2 2λλ i P β 2 i σ2 λ 2 i + λ2 λ i P β 2 i 2λ + λ i λ + λ i 4 λ + λ i 2λλ i P β 2 i 2σ2 λ 2 i + λ2 λ i P β 2 i λ + λ i 3 2λ 2 λ i P β 2 i + 2λλ2 i P β2 i 2σ2 λ 2 i 2λ2 λ i P β 2 i λ + λ i 3 2λλ 2 i P β2 i 2σ2 λ 2 i λ + λ i 3 ote that the denominator is strictly positive: ˆ X T X has strictly positive eigenvalues λ i } by assumption; and ˆ λ > 0 for all penalized regression problems. A sufficient condition for a sum to be negative is for each of its terms to be negative, so a sufficient 0

condition for the derivative to be negative is 2λλ 2 i P β 2 i 2σ 2 λ 2 i < 0 2λλ 2 i P β 2 i < 2σ 2 λ 2 i σ2 λ < P β 2 i for all i. Equivalently, the derivative is negative if: [ } σ 2 λ 0, min i P β 2 i Hence, since MSEX ˆβ Ridge λ is continuous at λ, we have MSEX ˆβ Ridge λ < MSEX ˆβ OLS for λ 0, σ 2 minp β 2 i i which proves the MSE existence theorem. 3. In standard form min Lβ + λ j β + j + β j subject to β + j 0, β j 0. Lagrange form is min Lβ + λ j β + j + β j j λ + j β+ j j λ j β j KKT conditions are β + Lβ + λ β + j + β j j j j β Lβ + λ β + j + β j j j j b The KKT conditions state which implies λ + j β+ j λ + j β+ j j j λ j β j λ j β j Lβ j + λ λ + j = 0 Lβ j + λ λ j = 0 } = Lβ j + λ λ + j = 0 } = Lβ j + λ λ j = 0 λ + j β+ j = = 0 λ j β j = 0 Lβ j = λ λ + j Lβ j = λ λ j By dual feasibility λ + j 0 along with λ j 0 imply Lβ j = λ λ + j Lβ j = λ λ j λ λ

Therefore Lβ j λ Suppose λ = 0. λ = 0 and Lβ j λ implies Lβ j = 0. Suppose β + j > 0 and λ > 0. β + j > 0 and λ > 0 along with λ + j β+ j = 0 implies λ + j = 0. Hence Lβ j + λ λ + j = 0 Lβ j = λ < 0 Also Lβ j + λ λ j = 0 λ j = 2λ > 0 so that λ j β j = 0 β j = 0. Suppose β j > 0 and λ > 0. β j > 0 and λ > 0 along with λ j β j = 0 implies λ j = 0. Hence Lβ j + λ λ j = 0 Lβ j = λ > 0 Also so that λ + j β+ j = 0 β + j = 0. Lβ j + λ λ + j = 0 λ+ j = 2λ > 0 Hence for any active predictor, β j 0 so that Lβ j = λ for β j > 0 β j < 0. And Lβ j = λ for β + j > 0 β j > 0. Or, Lβ j = sgnβ j λ. Finally note that Xj T Y Xβ = Lβ j = sgnβ j λ, which relates the correlation of the jth variable and current residuals to λ. } c Let Aλ = j ˆβ j λ 0 denote our active set. We assume Aλ does not change for λ [λ 0, λ ], and denote it A. Consider λ [λ 0, λ ]. ote from part b we have for active ˆβ j λ i.e. ˆβj λ such that j A, or more compactly X T j Y X ˆβλ = sgnβ j λλ X T AY X ˆβλ = sgnβ A λλ where X A, β A λ denote submatrix/subvector with components corresponding to active set A. Let s = sgn ˆβ A λ. ote that ˆβ j λ = 0 for j / A so that we may rewrite the above as Hence for X T A X A 0, we have and therefore X T AY X A ˆβλ = sλ ˆβ A λ = X T AX A X T AY sλ ˆβ A λ ˆβ A λ 0 = X T AX A sλ λ 0 which delivers our result for the active set. For the non active set we are done since β A C λ = 0 for all λ [λ 0, λ ]. 2

4. For this problem, we use the following definition of the Lasso: ˆβ Lasso λ = arg min β 2n y Xβ 2 2 + λ β } } Lβ,λ We assume that X is such that the above problem is strictly convex with a unique solution. We will need the subdifferential of L with respect to β: β Lβ, λ = β 2n y Xβ 2 2 + λ β = 2 XT y Xβ + λ sβ 2n = XT Xβ X T y + λ sβ n where s is the subgradient of the l -norm applied elementwise: 2 } x > 0 sx = [, ] x = 0. } x < 0 The subdifferential with respect to a specific element β i is given by the i-th element of the above: X T Xβ X T y βi Lβ, λ = + λ sβ i. n Suppose that λ is sufficiently large that ˆβ Lasso λ = 0. At the solution, the KKT conditions require that zero be in the subgradient of L with respect to each β i : Since sβ i, we must have: 0 βi Lβ = 0, λ XT y i n i + λ sβ i. λ X T y i /n. Suppose this were not true: that is, λ < X T y i /n ; then, even at sβ i = signx T y i /n, we would have XT y i + λ sβ i > 0 n for all elements of sβ i. This implies that 0 is not in the subdifferential and thus that β i = 0 could not be a solution. Since this argument holds for any i, we have: The smallest λ which satisfies this for all i gives λ max X T y i /n = X T y/n. i λ max = X T y/n. That is, we assume the columns of X are in general position. See [Tib3, Section 2.2] for details. 2 Here, and throughout, we interpret the sum of a scalar and a set to be a Minkowski-style sum: that is, b+a = b+a : a A}. 3

Under an alternate scaling of the Lasso, ˆβ Lasso λ = arg min β 2 y Xβ 2 2 + λ β a similar argument leads to λ max = X T y. 5. a Since β is feasible and ˆβ = arg min β is optimal: β Hence we get: b Since ˆβ is feasible ˆβ β = β S + β S C = β S ˆβ = β + ˆv = β S + ˆv S + β S C + ˆv S C = β S + ˆv S + ˆv S C β S ˆv S + ˆv S C β S β S ˆv S + ˆv S C ˆv S ˆv S C y X ˆβ 2 2 C y X ˆβ 2 2 y Xβ 2 2 C y Xβ 2 2 Xβ + w X ˆβ 2 2 w 2 2 C w 2 2 w Xˆv 2 2 w 2 2 C w 2 2 } w 2 2 C w 2 2 w 2 2 2ˆv T X T w + ˆvX T Xˆv Xˆv 2 2 2ˆv T X T w c ˆβ β implies ˆv C, and we have: } C w 2 2 Xˆv 2 2 2ˆvT X T w + C } w 2 2 Xˆv 2 2 2 XT w ˆv + Xˆv 2 2 w 2 XT ˆv + C } w 2 2 } C w 2 2 ˆv = ˆv S + ˆv S C 2 ˆv S 2 k ˆv 2 By γ-re condition on X: γ ˆv 2 2 Xˆv 2 2 w 2 XT ˆv + C } w 2 2 2 XT w 2 k ˆv 2 + C } w 2 2 4

Therefore, any estimate ˆβ based on the constrained lasso with C = y Xβ 2 2 satisfies the bound ˆv 2 4 k γ X T w i.e. ˆβ β 2 4 k γ X T w References [Tib3] Ryan J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:456 490, 203. http://projecteuclid.org/euclid.ejs/36948600; arxiv 206.033. 5