Homework 1: Solutions

Homework : Solutions Statistics 63 Fall 207 Data Analysis: ote: All data analysis results are provided by Xuyan Lu. Baseball Data: a What are the most important features for predicting player s salary? i Fit and visualize regularization paths for the following methods: ˆ Lasso: Coefficients 0.4 0.2 0.0 0.2 0.4 0.6 Lambda 0.4 0.08 0.0025 0.00034 4.5e 05 Hits CRuns Years Walks CAtBat League HmRun CHits CWalks AtBat 2 4 6 8 0 Log Lambda ## [] "Hits" "Walks" "Years" "CRuns" "PutOuts" The top 5 predictors selected are: Hits, Walks, Years CRuns and PutOuts.

ˆ Elastic et: For α = 0.4: Lambda 0.4 0.08 0.0025 0.00034 Hits Coefficients 0.4 0.2 0.0 0.2 0.4 CRuns Years Walks CAtBat League HmRun PutOuts CWalks AtBat 0 2 4 6 8 Log Lambda ## [] "Hits" "Walks" "CAtBat" "CHits" "CRuns" The top 5 predictors selected are: CRuns, Hits, CAtBat, Walks and CHits. 2

For α = 0.8: Coefficients 0.4 0.2 0.0 0.2 0.4 0.6 Lambda 0.4 0.08 0.0025 0.00034 Hits CRuns Years Walks CAtBat League HmRun CHits CWalks AtBat 2 4 6 8 Log Lambda ## [] "Hits" "Walks" "Years" "CHits" "CRuns" The top 5 predictors selected are: CRuns, Hits, Years, Walks and CHits. 3

ˆ Adaptive Lasso: I choose to use LS solution and γ = to get the weights: ŵ = ˆβ OLS. Lambda 0.4 0.08 0.0025 0.00034 Hits Coefficients 0.4 0.2 0.0 0.2 0.4 CRuns Years Walks CAtBat League HmRun CHits CWalks AtBat 0 2 4 6 8 Log Lambda ## [] "Hits" "Walks" "Years" "CHits" "CRuns" The top 5 predictors selected are: CRuns, Hits, Years, Walks and CHits. 4

ˆ SCAD: 0.6 0.4 0.2 β^ 0.0 0.2 0.4 0.5 0.4 0.3 0.2 0. 0.0 λ ## [] "Hits" "Walks" "CHits" "DivisionW" "PutOuts" The top 5 predictors selected are: Hits, Walks, CHits, DivisionW, PutOuts. 5

ˆ MC+: 0.6 0.4 0.2 β^ 0.0 0.2 0.4 0.5 0.4 0.3 0.2 0. 0.0 λ ## [] "Hits" "Walks" "CHits" "DivisionW" "PutOuts" The top 5 predictors selected are: Hits, Walks, CHits, DivisionW, PutOuts. ii What are the top predictors selected by each method? Are they different? If so, why? I count the munber of predictors selected by each model and get the top selected predictors of each model, the result is given in the previous question. The top predictors selected by all models are: *Hits* and *Walks*. But other top selected predictors are not the same, this is for the reason that the penalty terms for different models are not the same. For the convex model Lasso, Elastic Lasso, Adaptive Lasso the top selected predictors are very similar, and for non-convex model SCAD, MCP the top selected predictors are very similar. b Which linear method is best at predicting player s salary? i Compare the average prediction MSE on the test set for the following methods: LS Ridge Best Subset Lasso Aeveraged MSE 0.446366 0.389322 0.434976 0.3976438 Elastic et Adaptive Lasso SCAD MCP Aeveraged MSE 0.3965297 0.405767 0.403857 0.40577 The averaged prediction MSE is shown in the table above. Ridge, Lasso and Elastic et gives relatively good prediction, Least Square perform the worst among all methods. ii Visualize the results of your comparisons 6

0.3 0.4 0.5 0.6 LS Ridge B.S. Lasso E.. A.L. SCAD MC+ I did the boxplot of the MSE s of all method of 0 times 0 MSE s for each method and got the plot above. iii Reflection. From the boxplot above we can see that Ridge gives the best prediction error, and Lasso and Elastic et also gives good prediction error. These methods doing good is because they are shrinkage method and add constrains to ˆβ to prevent it from overfitting the training set. So these methods often have lower MSE compare to the Least Square. Least Square don t have constrains on ˆβ, so it is very easy to overfit the model, especially when there are too mutch features. Best Subset is a discreste process, so it often exibits high variace, which will result in bad performance on test set. ot all the methods choose the same subset of variables. Least Square and Ridge use all the variables but others only use some of them. The difference of subsets is due to the difference of penalty terms. 7

Math Problems: 2. We consider the problem y = Xβ + ɛ where y R n, β R p, X R n p, ɛ 0, I n n and n p with X full-rank. Recall that, for a general linear estimator, Xβ = X ˆβ, of Xβ: [ MSEX ˆβ = E X ˆβ Xβ X ˆβ ] Xβ [ = E X ˆβ E[X ˆβ] + E[X ˆβ] Xβ X ˆβ E[X ˆβ] + E[X ˆβ] ] Xβ [ = E X ˆβ E[X ˆβ] + E[X ˆβ] T Xβ X ˆβ E[X ˆβ] + E[X ˆβ] ] Xβ [ = E X ˆβ E[X ˆβ] X ˆβ E[X ˆβ] + X ˆβ E[X ˆβ] E[X ˆβ] Xβ + E[X ˆβ] Xβ X ˆβ E[X ˆβ] + E[X ˆβ] Xβ E[X ˆβ] ] Xβ [ = E X ˆβ E[X ˆβ] X ˆβ ˆβ] ] [ E[X + E X ˆβ E[X ˆβ] E[X ˆβ] ] Xβ [ + E E[X ˆβ] Xβ X ˆβ ˆβ] ] [ E[X + E E[X ˆβ] Xβ E[X ˆβ] ] Xβ [ = TrVarX ˆβ + E X ˆβ ˆβ] ] T E[X E[X ˆβ] Xβ + E[X ˆβ] T [ Xβ E X ˆβ E[X ˆβ] ] + E[ BiasX ˆβ 2 2] = TrVarX ˆβ + E[X ˆβ] E[X ˆβ] E[X ˆβ] Xβ = TrVarX ˆβ + 0 E[X ˆβ] Xβ = TrVarX ˆβ + BiasX ˆβ 2 2 For ˆβ OLS, we know ˆβ OLS is unbiased E[ ˆβ OLS ] = β so + + E[X ˆβ] Xβ E[X ˆβ] E[X ˆβ] + BiasX ˆβ 2 2 E[X ˆβ] Xβ T 0 + BiasX ˆβ 2 2 BiasX ˆβ OLS = E[X ˆβ OLS Xβ] = XE[ ˆβ OLS β] = X Bias ˆβ OLS = X 0 = 0 Hence the MSE is a function of the variance only: Var ˆβ OLS = VarX T X X T y = X T X X T Vary X T X X T T = X T X X T σ 2 I n XX T X = σ 2 X T X X T XX T X = σ 2 X T X = VarX ˆβ OLS = X Var ˆβ OLS X T = σ 2 XX T X X T MSEX ˆβ OLS = TrVarX ˆβ OLS + BiasX ˆβ OLS 2 2 = Trσ 2 XX T X X T + 0 2 2 = Trσ 2 X T X X T X + 0 = Trσ 2 I p p = σ 2 p If the OLS solution is unique, X T X is a real, symmetric, positive-definite matrix. 8

For ˆβ Ridge λ, we have to analyze both terms. We first note: [ ] E ˆβRidge λ = E [ X T X + λi X T y ] = X T X + λi X T E[y] = X T X + λi X T Xβ = X T X + λi X T X + λi λiβ = X T X + λi X T X + λiβ X T X + λi λiβ = β λx T X + λi β so giving and E[X ˆβ Ridge λ] = Xβ XλX T X + λi β BiasX ˆβ Ridge λ 2 2 = E[X ˆβ Ridge λ] Xβ E[X ˆβ Ridge λ] Xβ = λxx T X + λi β λxx T X + λi β = λ 2 β T X T X + λi X T XX T X + λi β Var ˆβ Ridge λ = Var X T X + λi X T y = X T X + λi X T Vary X T X + λi X T T = X T X + λi X T σ 2 IXX T X + λi = σ 2 X T X + λi X T XX T X + λi = VarX ˆβ Ridge λ = X Var ˆβ Ridge λx T = σ 2 XX T X + λi X T XX T X + λi X T giving MSEX ˆβ Ridge λ = Tr VarX ˆβ Ridge λ + BiasX ˆβ Ridge λ 2 = Tr σ 2 XX T X + λi X T XX T X + λi X T + λ 2 β T X T X + λi X T XX T X + λi β = Tr σ 2 X T X + λi X T XX T X + λi X T X + λ 2 β T X T X + λi X T XX T X + λi β [ ] 2 = σ 2 Tr X T X + λi X T X + λ 2 β T X T X + λi X T XX T X + λi β As expected, if λ = 0, we recover MSE ˆβ OLS from above. To simplify MSEX ˆβ Ridge λ further, we will take the eigendecomposition of X T X = P T DP where ˆ P R p p is an orthogonal matrix P T P = P P T = I p p ˆ D R p p is a diagonal matrix with all strictly positive elements ote that, with this decomposition X T X + λi can be simplified: X T X + λi = P T DP + λp T P = P T D + λip = P T D + λi P where D + λi is a diagonal matrix with elements λ i+λ where λ i} eigenvalues of X T X. 9

We use this to simplify the expression above: MSEX ˆβ Ridge λ = σ 2 Tr [P T D + λi P P T DP ] 2 + λ 2 β T P T D + λi P P T DP P T D + λi P β = σ 2 Tr [P T D + λi DP ] 2 + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr P T D + λi DP P T D + λi DP + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr P T D + λi DD + λi DP + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr D + λi DD + λi DP P T + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr D + λi DD + λi D + λ 2 β T P T D + λi DD + λi P β = σ 2 Tr + λ 2 β T P T λ i P β = σ 2 = i= i= λ 2 i λ+λ i 2 i λ 2 i λ + λ i 2 + λ2 σ 2 λ 2 i + λ2 λ i P β 2 i λ + λ i 2 for some fixed but unknown P β i. As before, note that, if λ = 0, we simply get: i= P β 2 i λ i λ + λ i 2 λ i+λ 2 i MSEX ˆβ Ridge 0 = i= σ 2 λ 2 i + 0 λ i P β 2 i 0 + λ 2 i = i= σ 2 λ 2 i λ 2 i = σ 2 p which is MSEX ˆβ OLS. Since MSEX ˆβ Ridge λ = MSEX ˆβ OLS at λ = 0, we know that a λ with lower MSE must exist if the derivative of MSEX ˆβ Ridge λ with respect to λ is negative at λ = 0. We differentiate with respect to λ: d MSEX ˆβ Ridge λ dλ = = = = i= i= i= i= λ + λ i 2 2λλ i P β 2 i σ2 λ 2 i + λ2 λ i P β 2 i 2λ + λ i λ + λ i 4 λ + λ i 2λλ i P β 2 i 2σ2 λ 2 i + λ2 λ i P β 2 i λ + λ i 3 2λ 2 λ i P β 2 i + 2λλ2 i P β2 i 2σ2 λ 2 i 2λ2 λ i P β 2 i λ + λ i 3 2λλ 2 i P β2 i 2σ2 λ 2 i λ + λ i 3 ote that the denominator is strictly positive: ˆ X T X has strictly positive eigenvalues λ i } by assumption; and ˆ λ > 0 for all penalized regression problems. A sufficient condition for a sum to be negative is for each of its terms to be negative, so a sufficient 0

condition for the derivative to be negative is 2λλ 2 i P β 2 i 2σ 2 λ 2 i < 0 2λλ 2 i P β 2 i < 2σ 2 λ 2 i σ2 λ < P β 2 i for all i. Equivalently, the derivative is negative if: [ } σ 2 λ 0, min i P β 2 i Hence, since MSEX ˆβ Ridge λ is continuous at λ, we have MSEX ˆβ Ridge λ < MSEX ˆβ OLS for λ 0, σ 2 minp β 2 i i which proves the MSE existence theorem. 3. In standard form min Lβ + λ j β + j + β j subject to β + j 0, β j 0. Lagrange form is min Lβ + λ j β + j + β j j λ + j β+ j j λ j β j KKT conditions are β + Lβ + λ β + j + β j j j j β Lβ + λ β + j + β j j j j b The KKT conditions state which implies λ + j β+ j λ + j β+ j j j λ j β j λ j β j Lβ j + λ λ + j = 0 Lβ j + λ λ j = 0 } = Lβ j + λ λ + j = 0 } = Lβ j + λ λ j = 0 λ + j β+ j = = 0 λ j β j = 0 Lβ j = λ λ + j Lβ j = λ λ j By dual feasibility λ + j 0 along with λ j 0 imply Lβ j = λ λ + j Lβ j = λ λ j λ λ

Therefore Lβ j λ Suppose λ = 0. λ = 0 and Lβ j λ implies Lβ j = 0. Suppose β + j > 0 and λ > 0. β + j > 0 and λ > 0 along with λ + j β+ j = 0 implies λ + j = 0. Hence Lβ j + λ λ + j = 0 Lβ j = λ < 0 Also Lβ j + λ λ j = 0 λ j = 2λ > 0 so that λ j β j = 0 β j = 0. Suppose β j > 0 and λ > 0. β j > 0 and λ > 0 along with λ j β j = 0 implies λ j = 0. Hence Lβ j + λ λ j = 0 Lβ j = λ > 0 Also so that λ + j β+ j = 0 β + j = 0. Lβ j + λ λ + j = 0 λ+ j = 2λ > 0 Hence for any active predictor, β j 0 so that Lβ j = λ for β j > 0 β j < 0. And Lβ j = λ for β + j > 0 β j > 0. Or, Lβ j = sgnβ j λ. Finally note that Xj T Y Xβ = Lβ j = sgnβ j λ, which relates the correlation of the jth variable and current residuals to λ. } c Let Aλ = j ˆβ j λ 0 denote our active set. We assume Aλ does not change for λ [λ 0, λ ], and denote it A. Consider λ [λ 0, λ ]. ote from part b we have for active ˆβ j λ i.e. ˆβj λ such that j A, or more compactly X T j Y X ˆβλ = sgnβ j λλ X T AY X ˆβλ = sgnβ A λλ where X A, β A λ denote submatrix/subvector with components corresponding to active set A. Let s = sgn ˆβ A λ. ote that ˆβ j λ = 0 for j / A so that we may rewrite the above as Hence for X T A X A 0, we have and therefore X T AY X A ˆβλ = sλ ˆβ A λ = X T AX A X T AY sλ ˆβ A λ ˆβ A λ 0 = X T AX A sλ λ 0 which delivers our result for the active set. For the non active set we are done since β A C λ = 0 for all λ [λ 0, λ ]. 2

4. For this problem, we use the following definition of the Lasso: ˆβ Lasso λ = arg min β 2n y Xβ 2 2 + λ β } } Lβ,λ We assume that X is such that the above problem is strictly convex with a unique solution. We will need the subdifferential of L with respect to β: β Lβ, λ = β 2n y Xβ 2 2 + λ β = 2 XT y Xβ + λ sβ 2n = XT Xβ X T y + λ sβ n where s is the subgradient of the l -norm applied elementwise: 2 } x > 0 sx = [, ] x = 0. } x < 0 The subdifferential with respect to a specific element β i is given by the i-th element of the above: X T Xβ X T y βi Lβ, λ = + λ sβ i. n Suppose that λ is sufficiently large that ˆβ Lasso λ = 0. At the solution, the KKT conditions require that zero be in the subgradient of L with respect to each β i : Since sβ i, we must have: 0 βi Lβ = 0, λ XT y i n i + λ sβ i. λ X T y i /n. Suppose this were not true: that is, λ < X T y i /n ; then, even at sβ i = signx T y i /n, we would have XT y i + λ sβ i > 0 n for all elements of sβ i. This implies that 0 is not in the subdifferential and thus that β i = 0 could not be a solution. Since this argument holds for any i, we have: The smallest λ which satisfies this for all i gives λ max X T y i /n = X T y/n. i λ max = X T y/n. That is, we assume the columns of X are in general position. See [Tib3, Section 2.2] for details. 2 Here, and throughout, we interpret the sum of a scalar and a set to be a Minkowski-style sum: that is, b+a = b+a : a A}. 3

Under an alternate scaling of the Lasso, ˆβ Lasso λ = arg min β 2 y Xβ 2 2 + λ β a similar argument leads to λ max = X T y. 5. a Since β is feasible and ˆβ = arg min β is optimal: β Hence we get: b Since ˆβ is feasible ˆβ β = β S + β S C = β S ˆβ = β + ˆv = β S + ˆv S + β S C + ˆv S C = β S + ˆv S + ˆv S C β S ˆv S + ˆv S C β S β S ˆv S + ˆv S C ˆv S ˆv S C y X ˆβ 2 2 C y X ˆβ 2 2 y Xβ 2 2 C y Xβ 2 2 Xβ + w X ˆβ 2 2 w 2 2 C w 2 2 w Xˆv 2 2 w 2 2 C w 2 2 } w 2 2 C w 2 2 w 2 2 2ˆv T X T w + ˆvX T Xˆv Xˆv 2 2 2ˆv T X T w c ˆβ β implies ˆv C, and we have: } C w 2 2 Xˆv 2 2 2ˆvT X T w + C } w 2 2 Xˆv 2 2 2 XT w ˆv + Xˆv 2 2 w 2 XT ˆv + C } w 2 2 } C w 2 2 ˆv = ˆv S + ˆv S C 2 ˆv S 2 k ˆv 2 By γ-re condition on X: γ ˆv 2 2 Xˆv 2 2 w 2 XT ˆv + C } w 2 2 2 XT w 2 k ˆv 2 + C } w 2 2 4

Therefore, any estimate ˆβ based on the constrained lasso with C = y Xβ 2 2 satisfies the bound ˆv 2 4 k γ X T w i.e. ˆβ β 2 4 k γ X T w References [Tib3] Ryan J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:456 490, 203. http://projecteuclid.org/euclid.ejs/36948600; arxiv 206.033. 5