Nondifferentiable Convex Functions

Nondifferentiable Convex Functions DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda

Applications Subgradients Optimization methods

Regression The aim is to learn a function h that relates a response or dependent variable y to several observed variables x 1, x 2,..., x p, known as covariates, features or independent variables The response is assumed to be of the form y = h ( x) + z where x R p contains the features and z is noise

Linear regression The regression function h is assumed to be linear y (i) = x (i) T β + z (i), 1 i n Our aim is to estimate β R p from the data

Linear regression In matrix form y (1) x (1) y (2) 1 x (1) 2 x (1) p β = x (2) 1 x (2) 2 x p (2) 1 z (1) β 2 + z (2) y (n) x (n) 1 x (n) 2 x p (n) β p z (n) Equivalently, y = X β + z

Sparse linear regression Only a subset of the features are relevant Model selection problem Two objectives: Good fit to the data; X β y 2 2 should be as small as possible Using a small number of features; β should be as sparse as possible

Sparse linear regression y (1) x (1) j x (1) l [ ] z (1) y (2) = x (2) j x (2) l β j β + z (2) y (n) x (n) l x (n) l z (n) l 0 x (1) 1 x (1) j x (1) l x (1) p x (2) = 1 x (2) j x (2) l x p (2) β z (1) j + z (2) x (n) 1 x (n) j x (n) l x p (n) β l z (n) 0 = X β + z

Sparse linear regression with 2 features y := α x 1 + z X := [ x 1 x 2 ] x 1 2 = 1 x 2 2 = 1 x 1, x 2 = ρ

Least squares: not sparse ( ) 1 β LS = X T X X T y 1 = 1 ρ x 1 T y ρ 1 x 2 T y = 1 1 ρ α + x 1 T z 1 ρ 2 ρ 1 αρ + x 2 T z = α + 1 x 1 ρ x 2, z 0 1 ρ 2 x 2 ρ x 1, z

The lasso Idea: Use l 1 -norm regularization to promote sparse coefficients β lasso := arg min β 1 2 y X β 2 +λ β 1 2

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative Proof: f (θ x + (1 θ) y)

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative Proof: f (θ x + (1 θ) y) = m α i f i (θ x + (1 θ) y) i=1

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative Proof: f (θ x + (1 θ) y) = m α i f i (θ x + (1 θ) y) i=1 m α i (θf i ( x) + (1 θ) f i ( y)) i=1

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative Proof: f (θ x + (1 θ) y) = m α i f i (θ x + (1 θ) y) i=1 m α i (θf i ( x) + (1 θ) f i ( y)) i=1 = θ f ( x) + (1 θ) f ( y)

Regularized least-squares Regularized least-squares cost functions are convex A x y 2 2 + x

It works 1.0 0.8 Coefficients 0.6 0.4 0.2 0.0 10-3 10-2 10-1 10 0 Regularization parameter

Ridge regression doesn t work 1.0 0.8 Coefficients 0.6 0.4 0.2 0.0 10-3 10-2 10-1 10 0 10 1 10 2 10 3 Regularization parameter

Prostate cancer data set 8 features (age, weight, analysis results) Response: Prostate-specific antigen (PSA), associated to cancer Training set: 60 patients Test set: 37 patients

Prostate cancer data set 1.5 Coefficient Values Training Loss Test Loss 1.0 0.5 0.0 10-2 10-1 10 0 λ

Principal component analysis Given n data vectors x 1, x 2,..., x n R d, 1. Center the data, c i = x i av ( x 1, x 2,..., x n ), 1 i n 2. Group the centered data as columns of a matrix 3. Compute the SVD of C C = [ c 1 c 2 c n ]. The left singular vectors are the principal directions The principal values are the coefficients of the centered vectors in the basis of principal directions.

Example 2 1 0 1 2 C := 2 1 0 1 2 2 1 0 1 2

Principal component analysis

Example 2 1 5 1 2 C := 2 1 0 1 2 2 1 0 1 2

Principal component analysis

Outliers Problem: Outliers distort principal directions Model: Data equals low-rank component + sparse component + = L S Y Idea: Fit model to data, then apply PCA to L

Robust PCA Data: Y R n m Robust PCA estimator of low-rank component: L RPCA := arg min L L + λ Y L 1 where λ > 0 is a regularization parameter Robust PCA estimator of sparse component: S RPCA := Y L RPCA 1 is the l 1 norm of the vectorized matrix

Example 4 Low Rank Sparse 2 0 2 4 10-1 10 0 λ

λ = 1 n L S

Large λ L S

Small λ L S

Background subtraction 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Background subtraction Matrix with vectorized frames as columns Static image: Y = [ x x x ] = x [ 1 1 1 ] Slowly varying background: Low-rank Rapidly varying foreground: Sparse

Frame 17 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Low-rank component 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Sparse component 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Frame 42 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Frame 75 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Gradient A differentiable function f : R n R is convex if and only if for every x, y R n f ( y) f ( x) + f ( x) T ( y x)

Gradient x f ( y)

Subgradient The subgradient of f : R n R at x R n is a vector g R n such that f ( y) f ( x) + g T ( y x), for all y R n Geometrically, the hyperplane y[1] T H g := y y[n + 1] = g y[n] is a supporting hyperplane of the epigraph at x The set of all subgradients at x is called the subdifferential

Subgradients

Subgradient of differentiable function If a function is differentiable, the only subgradient at each point is the gradient

Proof Assume g is a subgradient at x, for any α 0 f ( x + α e i ) f ( x) + g T α e i = f ( x) + g[i] α f ( x) f ( x α e i ) + g T α e i = f ( x α e i ) + g[i] α

Proof Assume g is a subgradient at x, for any α 0 Combining both inequalities f ( x + α e i ) f ( x) + g T α e i = f ( x) + g[i] α f ( x) f ( x α e i ) + g T α e i = f ( x α e i ) + g[i] α f ( x) f ( x α e i ) α g[i] f ( x + α e i) f ( x) α

Proof Assume g is a subgradient at x, for any α 0 Combining both inequalities f ( x + α e i ) f ( x) + g T α e i = f ( x) + g[i] α f ( x) f ( x α e i ) + g T α e i = f ( x α e i ) + g[i] α f ( x) f ( x α e i ) α g[i] f ( x + α e i) f ( x) α Letting α 0, implies g[i] = f ( x) x[i]

Subgradient A function f : R n R is convex if and only if it has a subgradient at every point It is strictly convex if and only for all x R n there exists g R n such that f ( y) > f ( x) + g T ( y x), for all y x.

Optimality condition for nondifferentiable functions If 0 is a subgradient of f at x, then f ( y) f ( x) + 0 T ( y x)

Optimality condition for nondifferentiable functions If 0 is a subgradient of f at x, then f ( y) f ( x) + 0 T ( y x) = f ( x) for all y R n

Optimality condition for nondifferentiable functions If 0 is a subgradient of f at x, then f ( y) f ( x) + 0 T ( y x) = f ( x) for all y R n Under strict convexity the minimum is unique

Sum of subgradients Let g 1 and g 2 be subgradients at x R n of f 1 : R n R and f 2 : R n R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x

Sum of subgradients Let g 1 and g 2 be subgradients at x R n of f 1 : R n R and f 2 : R n R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x Proof: For any y R n f ( y) = f 1 ( y) + f 2 ( y)

Sum of subgradients Let g 1 and g 2 be subgradients at x R n of f 1 : R n R and f 2 : R n R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x Proof: For any y R n f ( y) = f 1 ( y) + f 2 ( y) f 1 ( x) + g T 1 ( y x) + f 2 ( y) + g T 2 ( y x)

Sum of subgradients Let g 1 and g 2 be subgradients at x R n of f 1 : R n R and f 2 : R n R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x Proof: For any y R n f ( y) = f 1 ( y) + f 2 ( y) f 1 ( x) + g T 1 ( y x) + f 2 ( y) + g T 2 ( y x) f ( x) + g T ( y x)

Subgradient of scaled function Let g 1 be a subgradient at x R n of f 1 : R n R For any η 0 g 2 := η g 1 is a subgradient of f 2 := ηf 1 at x

Subgradient of scaled function Let g 1 be a subgradient at x R n of f 1 : R n R For any η 0 g 2 := η g 1 is a subgradient of f 2 := ηf 1 at x Proof: For any y R n f 2 ( y) = ηf 1 ( y)

Subgradient of scaled function Let g 1 be a subgradient at x R n of f 1 : R n R For any η 0 g 2 := η g 1 is a subgradient of f 2 := ηf 1 at x Proof: For any y R n f 2 ( y) = ηf 1 ( y) ( ) η f 1 ( x) + g 1 T ( y x)

Subgradient of scaled function Let g 1 be a subgradient at x R n of f 1 : R n R For any η 0 g 2 := η g 1 is a subgradient of f 2 := ηf 1 at x Proof: For any y R n f 2 ( y) = ηf 1 ( y) ( ) η f 1 ( x) + g 1 T ( y x) f 2 ( x) + g T 2 ( y x)

Subdifferential of absolute value f(x) = x

Subdifferential of absolute value At x 0, f (x) = x is differentiable, so g = sign (x) At x = 0, we need f (0 + y) f (0) + g (y 0)

Subdifferential of absolute value At x 0, f (x) = x is differentiable, so g = sign (x) At x = 0, we need f (0 + y) f (0) + g (y 0) y gy

Subdifferential of absolute value At x 0, f (x) = x is differentiable, so g = sign (x) At x = 0, we need f (0 + y) f (0) + g (y 0) Holds if and only if g 1 y gy

Subdifferential of l 1 norm g is a subgradient of the l 1 norm at x R n if and only if g[i] = sign (x[i]) if x[i] 0 g[i] 1 if x[i] = 0

Proof g is a subgradient of 1 at x if and only if g[i] is a subgradient of at x[i] for all 1 i n

Proof If g is a subgradient of 1 at x then for any y R y = x[i] + x + (y x[i]) e i 1 x 1

Proof If g is a subgradient of 1 at x then for any y R y = x[i] + x + (y x[i]) e i 1 x 1 x[i] + x 1 + g T (y x[i]) e i x 1

Proof If g is a subgradient of 1 at x then for any y R y = x[i] + x + (y x[i]) e i 1 x 1 x[i] + x 1 + g T (y x[i]) e i x 1 = x[i] + g[i] (y x[i]) so g[i] is a subgradient of at x[i] for all 1 i n

Proof If g[i] is a subgradient of at x[i] for 1 i n then for any y R n y 1 = n y [i] i=1

Proof If g[i] is a subgradient of at x[i] for 1 i n then for any y R n y 1 = n y [i] i=1 n x[i] + g[i] ( y [i] x[i]) i=1

Proof If g[i] is a subgradient of at x[i] for 1 i n then for any y R n y 1 = n y [i] i=1 n x[i] + g[i] ( y [i] x[i]) i=1 = x 1 + g T ( y x) so g is a subgradient of 1 at x

Subdifferential of l 1 norm

Subdifferential of the nuclear norm Let X R m n be a rank-r matrix with SVD USV T, where U R m r, V R n r and S R r r A matrix G is a subgradient of the nuclear norm at X if and only if where W satisfies G := UV T + W W 1 U T W = 0 W V = 0

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2 = max UV T x 2 + W 2 x 2 2 { x 2 =1 x R n }

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2 = max UV T x 2 + W 2 x 2 2 { x 2 =1 x R n } = max { x 2 =1 x R n } UV T P row(x ) x 2 W + Prow(X 2 ) x 2 2

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2 = max UV T x 2 + W 2 x 2 2 { x 2 =1 x R n } = max { x 2 =1 x R n } UV T P row(x ) x 2 W + Prow(X 2 ) x UV T 2 Prow(X ) x 2 2 + W Prow(X 2 ) x 2 2 2 2

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2 = max UV T x 2 + W 2 x 2 2 { x 2 =1 x R n } = max { x 2 =1 x R n } UV T P row(x ) x 2 W + Prow(X 2 ) x UV T 2 Prow(X ) x 2 2 + W Prow(X 2 ) x 1 2 2 2 2

Hölder s inequality for matrices For any matrix A R m n, A = sup A, B. { B 1 B R m n }

Proof For any matrix Y R m n Y G, Y = G, X + G, Y X = UV T, X + W, X + G, Y X

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 UV T, X

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 ( ) UV T, X = tr VU T X

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 UV T, X ( ) = tr VU T X = tr (VU ) T USV T

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 ( ) UV T, X = tr VU T X = tr (VU ) T USV T ( ) = tr V T V S

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 ( ) UV T, X = tr VU T X = tr (VU ) T USV T ( ) = tr V T V S = tr (S)

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 ( ) UV T, X = tr VU T X = tr (VU ) T USV T ( ) = tr V T V S = tr (S) = X

Proof For any matrix Y R m n Y G, Y = G, X + G, Y X = UV T, X + G, Y X = UV T, X + W, X + G, Y X = X + G, Y X

Sparse linear regression with 2 features y := α x 1 + z X := [ x 1 x 2 ] x 1 2 = 1 x 2 2 = 1 x 1, x 2 = ρ

Analysis of lasso estimator Let α 0 [ ] α + x T β lasso = 1 z λ 0 as long as x 2 T z ρ x T 1 z λ α + x 1 T 1 ρ z

Lasso estimator 1.0 0.8 Coefficients 0.6 0.4 0.2 0.0 0.00 0.05 0.10 0.15 0.20 Regularization parameter

Optimality condition for nondifferentiable functions If 0 is a subgradient of f at x, then f ( y) f ( x) + 0 T ( y x) = f ( x) for all y R n Under strict convexity the minimum is unique

Proof The cost function is strictly convex if n 2 and ρ 1 Aim: Show that there is a subgradient equal to 0 at a 1-sparse solution

Proof The gradient of the quadratic term ( ) q β := 1 2 X β y 2 2 at β lasso equals ( ) ( q βlasso = X T X β ) lasso y

Proof If only the first entry is nonzero and nonnegative [ ] 1 g l1 := γ is a subgradient of the l 1 norm at β lasso for any γ R such that γ 1

Proof If only the first entry is nonzero and nonnegative [ ] 1 g l1 := γ is a subgradient of the l 1 norm at β lasso for any γ R such that γ 1 ( ) In that case g lasso := q βlasso + λ g l1 is a subgradient of the cost function at β lasso

Proof If only the first entry is nonzero and nonnegative [ ] 1 g l1 := γ is a subgradient of the l 1 norm at β lasso for any γ R such that γ 1 ( ) In that case g lasso := q βlasso + λ g l1 is a subgradient of the cost function at β lasso If g lasso = 0 then β lasso is the unique solution

Proof ( g lasso := X T X β ) [ ] 1 lasso y + λ γ

Proof ( g lasso := X T X β ) [ ] 1 lasso y + λ γ = X T ( βlasso [1] x 1 α x 1 z ) + λ [ ] 1 γ

Proof ( g lasso := X T X β ) [ ] 1 lasso y + λ γ ( ) = X T βlasso [1] x 1 α x 1 z ( = x 1 T βlasso [1] x 1 α x 1 z) ( βlasso [1] x 1 α x 1 z) x T 2 [ ] 1 + λ γ + λ + λγ

Proof ( g lasso := X T X β ) [ ] 1 lasso y + λ γ ( ) [ ] = X T βlasso 1 [1] x 1 α x 1 z + λ γ ( = x 1 T βlasso [1] x 1 α x 1 z) + λ ( x 2 T βlasso [1] x 1 α x 1 z) + λγ [ ] βlasso [1] α x 1 T z + λ = ρβ lasso [1] ρα x 2 T z + λγ

Proof [ ] βlasso [1] α x g lasso = 1 T z + λ ρβ lasso [1] ρα x 2 T z + λγ Equal to 0 if

Proof [ ] βlasso [1] α x g lasso = 1 T z + λ ρβ lasso [1] ρα x 2 T z + λγ Equal to 0 if β lasso [1] = α + x T 1 z λ

Proof [ ] βlasso [1] α x g lasso = 1 T z + λ ρβ lasso [1] ρα x 2 T z + λγ Equal to 0 if β lasso [1] = α + x T 1 z λ γ = ρα + x T 2 z ρ β lasso [1] λ

Proof [ ] βlasso [1] α x g lasso = 1 T z + λ ρβ lasso [1] ρα x 2 T z + λγ Equal to 0 if β lasso [1] = α + x T 1 z λ γ = ρα + x 2 T z ρ β lasso [1] λ = x 2 T z ρ x T 1 z + ρ λ

Proof We still need to check that it s a valid subgradient at β lasso, i.e. βlasso [1] is nonnegative λ α + x T 1 γ 1 which holds if γ x T 2 λ z ρ x T 1 z λ + ρ 1 ρ x T 1 z + x T 2 z 1 ρ

Robust PCA Data: Y R n m Robust PCA estimator of low-rank component: L RPCA := arg min L L + λ Y L 1 where λ > 0 is a regularization parameter Robust PCA estimator of sparse component: S RPCA := Y L RPCA 1 is the l 1 norm of the vectorized matrix

Example 2 1 α 1 2 Y := 2 1 0 1 2 2 1 0 1 2

Analysis of robust PCA estimator The robust PCA estimates of both components are exact for any value of α as long as 2 2 < λ < 30 3

Example 4 Low Rank Sparse 2 0 2 4 10-1 10 0 λ

Optimality + uniqueness condition Let Y := L + S where L, S R m n L = U L S L V T L has rank r, U L R m r, V L R n r, S L R r r Assume there exists G := U L V T L + W where W satisfies W < 1, U T W = 0, W V = 0, and there also exists a matrix G l1 satisfying G l1 [i, j] = sign (S [i, j]) if S [i, j] 0, (1) G l1 [i, j] < 1 otherwise, (2) such that G + λg l1 = 0 Then the solution to the robust PCA problem is unique and equal to L

Optimality + uniqueness condition G := U L V T L + W is a subgradient of the nuclear norm at L G l1 is a subgradient of Y 1 at L G + λg l1 is a subgradient of the cost function at L G + λg l1 = 0 implies that L is a solution (uniqueness is more difficult to prove)

Example 2 1 α 1 2 Y := 2 1 0 1 2 2 1 0 1 2 We want to show that the solution is 2 1 0 1 2 L := 2 1 0 1 2 2 1 0 1 2 0 0 α 0 0 S := 0 0 0 0 0 0 0 0 0 0

Example 2 1 0 1 2 L := 2 1 0 1 2 2 1 0 1 2

Example 2 1 0 1 2 L := 2 1 0 1 2 2 1 0 1 2 = 1 1 1 ( 1 [ 30 2 1 0 1 3 1 10 ] ) 2

Example 2 1 0 1 2 L := 2 1 0 1 2 2 1 0 1 2 = 1 1 1 ( 1 [ ] ) 30 2 1 0 1 2 3 1 10 U L VL T = 1 1 1 2 1 0 1 2 ] 30 1

Example G = U L VL T + W = 1 1 1 [ 2 1 0 1 2 ] + W 30 1

Example G = U L VL T + W = 1 1 1 [ 2 1 0 1 2 ] + W 30 1 g 1 g 2 sign (α) g 3 g 4 G l1 = g 5 g 6 g 7 g 8 g 9 g 10 g 11 g 12 g 13 g 14

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 + sign (α)

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 λ sign (α) + sign (α)

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 λ sign (α) + 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 λ sign (α) + 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2 WV = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 0 0 0 0 0 0 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2 WV = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 0 0 0 0 0 0 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2 WV = 0 U T W = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2 WV = 0 U T W = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 WV = 0 U T W = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 G l1 [i, j] < 1 for S [i, j] = 0? WV = 0 U T W = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 WV = 0 U T W = 0 G l1 [i, j] < 1 for S [i, j] = 0? λ > 2/ 30

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 WV = 0 U T W = 0 G l1 [i, j] < 1 for S [i, j] = 0? λ > 2/ 30 W < 1?

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 WV = 0 U T W = 0 G l1 [i, j] < 1 for S [i, j] = 0? λ > 2/ 30 W < 1? λ < 2/3

Subgradient method Optimization problem minimize f ( x) where f is convex but nondifferentiable Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k g (k) where g (k) is a subgradient of f at x (k)

Least-squares regression with l 1 -norm regularization minimize 1 2 A x y 2 2 + λ x 1 Subgradient at x (k) g (k)

Least-squares regression with l 1 -norm regularization minimize 1 2 A x y 2 2 + λ x 1 Subgradient at x (k) ( ) g (k) = A T A x (k) y + λ sign ( x (k))

Least-squares regression with l 1 -norm regularization minimize 1 2 A x y 2 2 + λ x 1 Subgradient at x (k) ( ) g (k) = A T A x (k) y + λ sign ( x (k)) Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k (A T ( A x (k) y ) + λ sign ( x (k)))

Convergence of subgradient method It is not a descent method Convergence rate can be shown to be O ( 1/ɛ 2) Diminishing step sizes are necessary for convergence Experiment: minimize 1 2 A x y 2 2 + λ x 1 A R 2000 1000, y = A x + z where x is 100-sparse and z is iid Gaussian

Convergence of subgradient method 10 4 α 0 10 3 α 0 / k α 0 /k 10 2 f(x (k) ) f(x ) f(x ) 10 1 10 0 10-1 10-2 0 20 40 60 80 100 k

Convergence of subgradient method 10 1 10 0 α 0 α 0 / n α 0 /n f(x (k) ) f(x ) f(x ) 10-1 10-2 10-3 0 1000 2000 3000 4000 5000 k

Composite functions Interesting class of functions for data analysis f ( x) + h ( x) f convex and differentiable, h convex but not differentiable Example: 1 2 A x y 2 2 + λ x 1

Motivation Aim: Minimize convex differentiable function f Idea: Iteratively minimize first-order approximation, while staying close to current point x (0) = arbitrary initialization x (k+1) = arg min f x ( x (k)) + f ( x (k)) T ( x x (k)) + 1 where α k is a parameter that determines how close we stay x x (k) 2 α k 2 2

Motivation Linear approximation+ l 2 term is convex ( ( f x (k)) + f ( x (k)) T ( x x (k)) + 1 ) x x (k) 2 2 α k 2

Motivation Linear approximation+ l 2 term is convex ( ( f x (k)) + f ( x (k)) T ( x x (k)) + 1 ) x x (k) 2 2 α k 2 = f ( x (k)) x x (k) + α k

Motivation Linear approximation+ l 2 term is convex ( ( f x (k)) + f ( x (k)) T ( x x (k)) + 1 ) x x (k) 2 2 α k 2 = f ( x (k)) x x (k) + α k Setting the gradient to zero ( x (k+1) = arg min f x (k)) + f ( x (k)) T ( x x (k)) + 1 x x (k) x 2 α k 2 2

Motivation Linear approximation+ l 2 term is convex ( ( f x (k)) + f ( x (k)) T ( x x (k)) + 1 ) x x (k) 2 2 α k 2 = f ( x (k)) x x (k) + α k Setting the gradient to zero ( x (k+1) = arg min f x (k)) + f x = x (k) α k f ( x (k)) ( x (k)) T ( x x (k)) + 1 x x (k) 2 α k 2 2

Proximal gradient method Idea: Minimize local first-order approximation + h ( x (k+1) = arg min f x (k)) + f ( x (k)) T ( x x (k)) + 1 x x (k) x 2 α k + h ( x) 1 ( = arg min x x (k) α k f ( x (k))) 2 x 2 + α k h ( x) 2 ( = prox αk h x (k) α k f ( x (k))) 2 2 Proximal operator: prox h (y) := arg min x h ( x) + 1 2 y x 2 2

Proximal gradient method Method to solve the optimization problem minimize f ( x) + h ( x), where f is differentiable and prox h is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk h x (k) α k f ( x (k)))

Interpretation as a fixed-point method A vector x is a solution to minimize f ( x) + h ( x), if and only if it is a fixed point of the proximal-gradient iteration for any α > 0 x = prox α h ( x α f ( x ))

Proof x is the solution to min x α h ( x) + 1 2 x α f ( x ) x 2 2 (3) if and only if there is a subgradient g of h at x such that

Proof x is the solution to min x α h ( x) + 1 2 x α f ( x ) x 2 2 (3) if and only if there is a subgradient g of h at x such that α f ( x ) + α g = 0

Proof x is the solution to min x α h ( x) + 1 2 x α f ( x ) x 2 2 (3) if and only if there is a subgradient g of h at x such that α f ( x ) + α g = 0 x minimizes f + h if and only if there is a subgradient g of h at x such that

Proof x is the solution to min x α h ( x) + 1 2 x α f ( x ) x 2 2 (3) if and only if there is a subgradient g of h at x such that α f ( x ) + α g = 0 x minimizes f + h if and only if there is a subgradient g of h at x such that f ( x ) + g = 0

Proximal operator of l 1 norm The proximal operator of the l 1 norm is the soft-thresholding operator where α > 0 and S α (y) i := prox α 1 (y) = S α (y) { y i sign (y i ) α if y i α 0 otherwise

Proof α x 1 + 1 m 2 y x 2 2 = α x[i] + 1 ( y[i] x[i])2 2 We can just consider w (x) := α x + 1 2 (y x)2 = y 2 + x 2 + α x yx 2 i=1

Proof If x 0 w (x) = y 2 + x 2 w (x) = 2 (y α) x

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α)

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α) If y α minimum at

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α) If y α minimum at x := y α

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α) If y α minimum at x := y α If y < α minimum at

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α) If y α minimum at x := y α If y < α minimum at 0

Proof If x < 0 w (x) = y 2 + x 2 w (x) = 2 (y + α) x

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α)

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α) If y α minimum at

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α) If y α minimum at x := y + α

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α) If y α minimum at x := y + α If y α minimum at

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α) If y α minimum at x := y + α If y α minimum at 0

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but w (y α)

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but w (y α) = α (y α) + α2 2

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but w (y α) = α (y α) + α2 2 = αy α2 2

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but because (y α) 2 0 w (y α) = α (y α) + α2 2 = αy α2 2 y 2 2 = w (0)

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but because (y α) 2 0 Same argument for y < α w (y α) = α (y α) + α2 2 = αy α2 2 y 2 2 = w (0)

Iterative Shrinkage-Thresholding Algorithm (ISTA) The proximal gradient method for the problem is called ISTA minimize 1 2 A x y 2 2 + λ x 1 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = S αk λ x (k) α k A T A x (k) y

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) ISTA can be accelerated using Nesterov s accelerated gradient method FISTA iteration: x (0) = arbitrary initialization z (0) = x (0) ( ( )) x (k+1) = S αk λ z (k) α k A T A z (k) y z (k+1) = x (k+1) + k ( x (k+1) x (k)) k + 3

Convergence of proximal gradient method Without acceleration: Descent method Convergence rate can be shown to be O (1/ɛ) with constant step or backtracking line search With acceleration: Not a descent method ( ) Convergence rate can be shown to be O 1 ɛ backtracking line search Experiment: minimize 1 2 A x y 2 2 + λ x 1 with constant step or A R 2000 1000, y = A x 0 + z, x 0 100-sparse and z iid Gaussian

Convergence of proximal gradient method 10 4 10 3 10 2 10 1 Subg. method (α 0 / k) ISTA FISTA f(x (k) ) f(x ) f(x ) 10 0 10-1 10-2 10-3 10-4 10-5 0 20 40 60 80 100 k

Coordinate descent Idea: Solve the n-dimensional problem minimize c ( x[1], x[2],..., x[n]) by solving a sequence of 1D problems Coordinate-descent iteration: x (0) = arbitrary initialization x (k+1) [i] = arg min c α ( x (k) [1],..., α,..., x (k) [n] ) for some 1 i n

Coordinate descent Convergence is guaranteed for functions of the form n f ( x) + h i ( x[i]) where f is convex and differentiable and h 1,..., h n are convex i=1

Least-squares regression with l 1 -norm regularization h ( x) := 1 2 A x y 2 2 + λ x 1 The solution to the subproblem min x[i] h ( x[1],..., x[i],..., x[n]) is x [i] = S λ (γ i ) A i 2 2 where A i is the ith column of A and m γ i := A li y[l] l=1 j i A lj x[j]