Nondifferentiable Convex Functions

Σχετικά έγγραφα
Chapter 6: Systems of Linear Differential. be continuous functions on the interval

2 Composition. Invertible Mappings

Numerical Analysis FMN011

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Other Test Constructions: Likelihood Ratio & Bayes Tests

6.3 Forecasting ARMA processes

Statistical Inference I Locally most powerful tests

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Example Sheet 3 Solutions

Homework 3 Solutions

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

EE512: Error Control Coding

Reminders: linear functions

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

C.S. 430 Assignment 6, Sample Solutions

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Matrices and Determinants

Lecture 21: Properties and robustness of LSE

Parametrized Surfaces

Every set of first-order formulas is equivalent to an independent set

Uniform Convergence of Fourier Series Michael Taylor

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

Concrete Mathematics Exercises from 30 September 2016

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

The ε-pseudospectrum of a Matrix

Fractional Colorings and Zykov Products of graphs

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Solution Series 9. i=1 x i and i=1 x i.

Lecture 2. Soundness and completeness of propositional logic

2. Let H 1 and H 2 be Hilbert spaces and let T : H 1 H 2 be a bounded linear operator. Prove that [T (H 1 )] = N (T ). (6p)

Tridiagonal matrices. Gérard MEURANT. October, 2008

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Lecture 34: Ridge regression and LASSO

Partial Differential Equations in Biology The boundary element method. March 26, 2013

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Lecture 34 Bootstrap confidence intervals

ORDINAL ARITHMETIC JULIAN J. SCHLÖDER

Quadratic Expressions

Inverse trigonometric functions & General Solution of Trigonometric Equations

ST5224: Advanced Statistical Theory II

Optimal Parameter in Hermitian and Skew-Hermitian Splitting Method for Certain Two-by-Two Block Matrices

Pg The perimeter is P = 3x The area of a triangle is. where b is the base, h is the height. In our case b = x, then the area is

Second Order Partial Differential Equations

The Simply Typed Lambda Calculus

Homework 8 Model Solution Section

Tutorial on Multinomial Logistic Regression

New bounds for spherical two-distance sets and equiangular lines

The challenges of non-stable predicates

Math 446 Homework 3 Solutions. (1). (i): Reverse triangle inequality for metrics: Let (X, d) be a metric space and let x, y, z X.

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

5. Choice under Uncertainty

4.6 Autoregressive Moving Average Model ARMA(1,1)

Approximation of distance between locations on earth given by latitude and longitude

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with

Congruence Classes of Invertible Matrices of Order 3 over F 2

Section 8.3 Trigonometric Equations

Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Second Order RLC Filters

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Srednicki Chapter 55

Μηχανική Μάθηση Hypothesis Testing

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Math221: HW# 1 solutions

= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

Modern Bayesian Statistics Part III: high-dimensional modeling Example 3: Sparse and time-varying covariance modeling

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

D Alembert s Solution to the Wave Equation

Figure A.2: MPC and MPCP Age Profiles (estimating ρ, ρ = 2, φ = 0.03)..

Notes on the Open Economy

Problem Set 3: Solutions

Local Approximation with Kernels

derivation of the Laplacian from rectangular to spherical coordinates

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 10η: Basics of Game Theory part 2 Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality

Structural and Multidisciplinary Optimization. P. Duysinx, P. Tossings*

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Section 7.6 Double and Half Angle Formulas

Solutions to Exercise Sheet 5

Introduction to the ML Estimation of ARMA processes

Lecture 15 - Root System Axiomatics

6. MAXIMUM LIKELIHOOD ESTIMATION

CRASH COURSE IN PRECALCULUS

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Lecture 10 - Representation Theory III: Theory of Weights

MATH1030 Linear Algebra Assignment 5 Solution

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

w o = R 1 p. (1) R = p =. = 1

Iterated trilinear fourier integrals with arbitrary symbols

Θεωρία Πληροφορίας και Κωδίκων

Transcript:

Nondifferentiable Convex Functions DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda

Applications Subgradients Optimization methods

Regression The aim is to learn a function h that relates a response or dependent variable y to several observed variables x 1, x 2,..., x p, known as covariates, features or independent variables The response is assumed to be of the form y = h ( x) + z where x R p contains the features and z is noise

Linear regression The regression function h is assumed to be linear y (i) = x (i) T β + z (i), 1 i n Our aim is to estimate β R p from the data

Linear regression In matrix form y (1) x (1) y (2) 1 x (1) 2 x (1) p β = x (2) 1 x (2) 2 x p (2) 1 z (1) β 2 + z (2) y (n) x (n) 1 x (n) 2 x p (n) β p z (n) Equivalently, y = X β + z

Sparse linear regression Only a subset of the features are relevant Model selection problem Two objectives: Good fit to the data; X β y 2 2 should be as small as possible Using a small number of features; β should be as sparse as possible

Sparse linear regression y (1) x (1) j x (1) l [ ] z (1) y (2) = x (2) j x (2) l β j β + z (2) y (n) x (n) l x (n) l z (n) l 0 x (1) 1 x (1) j x (1) l x (1) p x (2) = 1 x (2) j x (2) l x p (2) β z (1) j + z (2) x (n) 1 x (n) j x (n) l x p (n) β l z (n) 0 = X β + z

Sparse linear regression with 2 features y := α x 1 + z X := [ x 1 x 2 ] x 1 2 = 1 x 2 2 = 1 x 1, x 2 = ρ

Least squares: not sparse ( ) 1 β LS = X T X X T y 1 = 1 ρ x 1 T y ρ 1 x 2 T y = 1 1 ρ α + x 1 T z 1 ρ 2 ρ 1 αρ + x 2 T z = α + 1 x 1 ρ x 2, z 0 1 ρ 2 x 2 ρ x 1, z

The lasso Idea: Use l 1 -norm regularization to promote sparse coefficients β lasso := arg min β 1 2 y X β 2 +λ β 1 2

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative Proof: f (θ x + (1 θ) y)

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative Proof: f (θ x + (1 θ) y) = m α i f i (θ x + (1 θ) y) i=1

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative Proof: f (θ x + (1 θ) y) = m α i f i (θ x + (1 θ) y) i=1 m α i (θf i ( x) + (1 θ) f i ( y)) i=1

Nonnegative weighted sums The weighted sum of m convex functions f 1,..., f m f := m α i f i i=1 is convex if α 1,..., α R are nonnegative Proof: f (θ x + (1 θ) y) = m α i f i (θ x + (1 θ) y) i=1 m α i (θf i ( x) + (1 θ) f i ( y)) i=1 = θ f ( x) + (1 θ) f ( y)

Regularized least-squares Regularized least-squares cost functions are convex A x y 2 2 + x

It works 1.0 0.8 Coefficients 0.6 0.4 0.2 0.0 10-3 10-2 10-1 10 0 Regularization parameter

Ridge regression doesn t work 1.0 0.8 Coefficients 0.6 0.4 0.2 0.0 10-3 10-2 10-1 10 0 10 1 10 2 10 3 Regularization parameter

Prostate cancer data set 8 features (age, weight, analysis results) Response: Prostate-specific antigen (PSA), associated to cancer Training set: 60 patients Test set: 37 patients

Prostate cancer data set 1.5 Coefficient Values Training Loss Test Loss 1.0 0.5 0.0 10-2 10-1 10 0 λ

Principal component analysis Given n data vectors x 1, x 2,..., x n R d, 1. Center the data, c i = x i av ( x 1, x 2,..., x n ), 1 i n 2. Group the centered data as columns of a matrix 3. Compute the SVD of C C = [ c 1 c 2 c n ]. The left singular vectors are the principal directions The principal values are the coefficients of the centered vectors in the basis of principal directions.

Example 2 1 0 1 2 C := 2 1 0 1 2 2 1 0 1 2

Principal component analysis

Example 2 1 5 1 2 C := 2 1 0 1 2 2 1 0 1 2

Principal component analysis

Outliers Problem: Outliers distort principal directions Model: Data equals low-rank component + sparse component + = L S Y Idea: Fit model to data, then apply PCA to L

Robust PCA Data: Y R n m Robust PCA estimator of low-rank component: L RPCA := arg min L L + λ Y L 1 where λ > 0 is a regularization parameter Robust PCA estimator of sparse component: S RPCA := Y L RPCA 1 is the l 1 norm of the vectorized matrix

Example 4 Low Rank Sparse 2 0 2 4 10-1 10 0 λ

λ = 1 n L S

Large λ L S

Small λ L S

Background subtraction 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Background subtraction Matrix with vectorized frames as columns Static image: Y = [ x x x ] = x [ 1 1 1 ] Slowly varying background: Low-rank Rapidly varying foreground: Sparse

Frame 17 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Low-rank component 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Sparse component 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Frame 42 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Low-rank component 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Sparse component 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Frame 75 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Low-rank component 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Sparse component 20 40 60 80 100 120 20 40 60 80 100 120 140 160

Applications Subgradients Optimization methods

Gradient A differentiable function f : R n R is convex if and only if for every x, y R n f ( y) f ( x) + f ( x) T ( y x)

Gradient x f ( y)

Subgradient The subgradient of f : R n R at x R n is a vector g R n such that f ( y) f ( x) + g T ( y x), for all y R n Geometrically, the hyperplane y[1] T H g := y y[n + 1] = g y[n] is a supporting hyperplane of the epigraph at x The set of all subgradients at x is called the subdifferential

Subgradients

Subgradient of differentiable function If a function is differentiable, the only subgradient at each point is the gradient

Proof Assume g is a subgradient at x, for any α 0 f ( x + α e i ) f ( x) + g T α e i = f ( x) + g[i] α f ( x) f ( x α e i ) + g T α e i = f ( x α e i ) + g[i] α

Proof Assume g is a subgradient at x, for any α 0 Combining both inequalities f ( x + α e i ) f ( x) + g T α e i = f ( x) + g[i] α f ( x) f ( x α e i ) + g T α e i = f ( x α e i ) + g[i] α f ( x) f ( x α e i ) α g[i] f ( x + α e i) f ( x) α

Proof Assume g is a subgradient at x, for any α 0 Combining both inequalities f ( x + α e i ) f ( x) + g T α e i = f ( x) + g[i] α f ( x) f ( x α e i ) + g T α e i = f ( x α e i ) + g[i] α f ( x) f ( x α e i ) α g[i] f ( x + α e i) f ( x) α Letting α 0, implies g[i] = f ( x) x[i]

Subgradient A function f : R n R is convex if and only if it has a subgradient at every point It is strictly convex if and only for all x R n there exists g R n such that f ( y) > f ( x) + g T ( y x), for all y x.

Optimality condition for nondifferentiable functions If 0 is a subgradient of f at x, then f ( y) f ( x) + 0 T ( y x)

Optimality condition for nondifferentiable functions If 0 is a subgradient of f at x, then f ( y) f ( x) + 0 T ( y x) = f ( x) for all y R n

Optimality condition for nondifferentiable functions If 0 is a subgradient of f at x, then f ( y) f ( x) + 0 T ( y x) = f ( x) for all y R n Under strict convexity the minimum is unique

Sum of subgradients Let g 1 and g 2 be subgradients at x R n of f 1 : R n R and f 2 : R n R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x

Sum of subgradients Let g 1 and g 2 be subgradients at x R n of f 1 : R n R and f 2 : R n R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x Proof: For any y R n f ( y) = f 1 ( y) + f 2 ( y)

Sum of subgradients Let g 1 and g 2 be subgradients at x R n of f 1 : R n R and f 2 : R n R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x Proof: For any y R n f ( y) = f 1 ( y) + f 2 ( y) f 1 ( x) + g T 1 ( y x) + f 2 ( y) + g T 2 ( y x)

Sum of subgradients Let g 1 and g 2 be subgradients at x R n of f 1 : R n R and f 2 : R n R g := g 1 + g 2 is a subgradient of f := f 1 + f 2 at x Proof: For any y R n f ( y) = f 1 ( y) + f 2 ( y) f 1 ( x) + g T 1 ( y x) + f 2 ( y) + g T 2 ( y x) f ( x) + g T ( y x)

Subgradient of scaled function Let g 1 be a subgradient at x R n of f 1 : R n R For any η 0 g 2 := η g 1 is a subgradient of f 2 := ηf 1 at x

Subgradient of scaled function Let g 1 be a subgradient at x R n of f 1 : R n R For any η 0 g 2 := η g 1 is a subgradient of f 2 := ηf 1 at x Proof: For any y R n f 2 ( y) = ηf 1 ( y)

Subgradient of scaled function Let g 1 be a subgradient at x R n of f 1 : R n R For any η 0 g 2 := η g 1 is a subgradient of f 2 := ηf 1 at x Proof: For any y R n f 2 ( y) = ηf 1 ( y) ( ) η f 1 ( x) + g 1 T ( y x)

Subgradient of scaled function Let g 1 be a subgradient at x R n of f 1 : R n R For any η 0 g 2 := η g 1 is a subgradient of f 2 := ηf 1 at x Proof: For any y R n f 2 ( y) = ηf 1 ( y) ( ) η f 1 ( x) + g 1 T ( y x) f 2 ( x) + g T 2 ( y x)

Subdifferential of absolute value f(x) = x

Subdifferential of absolute value At x 0, f (x) = x is differentiable, so g = sign (x) At x = 0, we need f (0 + y) f (0) + g (y 0)

Subdifferential of absolute value At x 0, f (x) = x is differentiable, so g = sign (x) At x = 0, we need f (0 + y) f (0) + g (y 0) y gy

Subdifferential of absolute value At x 0, f (x) = x is differentiable, so g = sign (x) At x = 0, we need f (0 + y) f (0) + g (y 0) Holds if and only if g 1 y gy

Subdifferential of l 1 norm g is a subgradient of the l 1 norm at x R n if and only if g[i] = sign (x[i]) if x[i] 0 g[i] 1 if x[i] = 0

Proof g is a subgradient of 1 at x if and only if g[i] is a subgradient of at x[i] for all 1 i n

Proof If g is a subgradient of 1 at x then for any y R y = x[i] + x + (y x[i]) e i 1 x 1

Proof If g is a subgradient of 1 at x then for any y R y = x[i] + x + (y x[i]) e i 1 x 1 x[i] + x 1 + g T (y x[i]) e i x 1

Proof If g is a subgradient of 1 at x then for any y R y = x[i] + x + (y x[i]) e i 1 x 1 x[i] + x 1 + g T (y x[i]) e i x 1 = x[i] + g[i] (y x[i]) so g[i] is a subgradient of at x[i] for all 1 i n

Proof If g[i] is a subgradient of at x[i] for 1 i n then for any y R n y 1 = n y [i] i=1

Proof If g[i] is a subgradient of at x[i] for 1 i n then for any y R n y 1 = n y [i] i=1 n x[i] + g[i] ( y [i] x[i]) i=1

Proof If g[i] is a subgradient of at x[i] for 1 i n then for any y R n y 1 = n y [i] i=1 n x[i] + g[i] ( y [i] x[i]) i=1 = x 1 + g T ( y x) so g is a subgradient of 1 at x

Subdifferential of l 1 norm

Subdifferential of l 1 norm

Subdifferential of l 1 norm

Subdifferential of the nuclear norm Let X R m n be a rank-r matrix with SVD USV T, where U R m r, V R n r and S R r r A matrix G is a subgradient of the nuclear norm at X if and only if where W satisfies G := UV T + W W 1 U T W = 0 W V = 0

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2 = max UV T x 2 + W 2 x 2 2 { x 2 =1 x R n }

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2 = max UV T x 2 + W 2 x 2 2 { x 2 =1 x R n } = max { x 2 =1 x R n } UV T P row(x ) x 2 W + Prow(X 2 ) x 2 2

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2 = max UV T x 2 + W 2 x 2 2 { x 2 =1 x R n } = max { x 2 =1 x R n } UV T P row(x ) x 2 W + Prow(X 2 ) x UV T 2 Prow(X ) x 2 2 + W Prow(X 2 ) x 2 2 2 2

Proof By Pythagoras Theorem, for any x R m with unit l 2 norm we have P row(x ) x 2 2 + Prow(X ) x 2 = 2 x 2 2 = 1 The rows of UV T are in row (X ) and the rows of W in row (X ), so G 2 := max { x 2 =1 x R n } G x 2 2 = max UV T x 2 + W 2 x 2 2 { x 2 =1 x R n } = max { x 2 =1 x R n } UV T P row(x ) x 2 W + Prow(X 2 ) x UV T 2 Prow(X ) x 2 2 + W Prow(X 2 ) x 1 2 2 2 2

Hölder s inequality for matrices For any matrix A R m n, A = sup A, B. { B 1 B R m n }

Proof For any matrix Y R m n Y G, Y = G, X + G, Y X = UV T, X + W, X + G, Y X

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 UV T, X

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 ( ) UV T, X = tr VU T X

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 UV T, X ( ) = tr VU T X = tr (VU ) T USV T

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 ( ) UV T, X = tr VU T X = tr (VU ) T USV T ( ) = tr V T V S

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 ( ) UV T, X = tr VU T X = tr (VU ) T USV T ( ) = tr V T V S = tr (S)

Proof U T W = 0 implies W, X = W, USV T = U T W, SV T = 0 ( ) UV T, X = tr VU T X = tr (VU ) T USV T ( ) = tr V T V S = tr (S) = X

Proof For any matrix Y R m n Y G, Y = G, X + G, Y X = UV T, X + G, Y X = UV T, X + W, X + G, Y X = X + G, Y X

Sparse linear regression with 2 features y := α x 1 + z X := [ x 1 x 2 ] x 1 2 = 1 x 2 2 = 1 x 1, x 2 = ρ

Analysis of lasso estimator Let α 0 [ ] α + x T β lasso = 1 z λ 0 as long as x 2 T z ρ x T 1 z λ α + x 1 T 1 ρ z

Lasso estimator 1.0 0.8 Coefficients 0.6 0.4 0.2 0.0 0.00 0.05 0.10 0.15 0.20 Regularization parameter

Optimality condition for nondifferentiable functions If 0 is a subgradient of f at x, then f ( y) f ( x) + 0 T ( y x) = f ( x) for all y R n Under strict convexity the minimum is unique

Proof The cost function is strictly convex if n 2 and ρ 1 Aim: Show that there is a subgradient equal to 0 at a 1-sparse solution

Proof The gradient of the quadratic term ( ) q β := 1 2 X β y 2 2 at β lasso equals ( ) ( q βlasso = X T X β ) lasso y

Proof If only the first entry is nonzero and nonnegative [ ] 1 g l1 := γ is a subgradient of the l 1 norm at β lasso for any γ R such that γ 1

Proof If only the first entry is nonzero and nonnegative [ ] 1 g l1 := γ is a subgradient of the l 1 norm at β lasso for any γ R such that γ 1 ( ) In that case g lasso := q βlasso + λ g l1 is a subgradient of the cost function at β lasso

Proof If only the first entry is nonzero and nonnegative [ ] 1 g l1 := γ is a subgradient of the l 1 norm at β lasso for any γ R such that γ 1 ( ) In that case g lasso := q βlasso + λ g l1 is a subgradient of the cost function at β lasso If g lasso = 0 then β lasso is the unique solution

Proof ( g lasso := X T X β ) [ ] 1 lasso y + λ γ

Proof ( g lasso := X T X β ) [ ] 1 lasso y + λ γ = X T ( βlasso [1] x 1 α x 1 z ) + λ [ ] 1 γ

Proof ( g lasso := X T X β ) [ ] 1 lasso y + λ γ ( ) = X T βlasso [1] x 1 α x 1 z ( = x 1 T βlasso [1] x 1 α x 1 z) ( βlasso [1] x 1 α x 1 z) x T 2 [ ] 1 + λ γ + λ + λγ

Proof ( g lasso := X T X β ) [ ] 1 lasso y + λ γ ( ) [ ] = X T βlasso 1 [1] x 1 α x 1 z + λ γ ( = x 1 T βlasso [1] x 1 α x 1 z) + λ ( x 2 T βlasso [1] x 1 α x 1 z) + λγ [ ] βlasso [1] α x 1 T z + λ = ρβ lasso [1] ρα x 2 T z + λγ

Proof [ ] βlasso [1] α x g lasso = 1 T z + λ ρβ lasso [1] ρα x 2 T z + λγ Equal to 0 if

Proof [ ] βlasso [1] α x g lasso = 1 T z + λ ρβ lasso [1] ρα x 2 T z + λγ Equal to 0 if β lasso [1] = α + x T 1 z λ

Proof [ ] βlasso [1] α x g lasso = 1 T z + λ ρβ lasso [1] ρα x 2 T z + λγ Equal to 0 if β lasso [1] = α + x T 1 z λ γ = ρα + x T 2 z ρ β lasso [1] λ

Proof [ ] βlasso [1] α x g lasso = 1 T z + λ ρβ lasso [1] ρα x 2 T z + λγ Equal to 0 if β lasso [1] = α + x T 1 z λ γ = ρα + x 2 T z ρ β lasso [1] λ = x 2 T z ρ x T 1 z + ρ λ

Proof We still need to check that it s a valid subgradient at β lasso, i.e. βlasso [1] is nonnegative λ α + x T 1 γ 1 which holds if γ x T 2 λ z ρ x T 1 z λ + ρ 1 ρ x T 1 z + x T 2 z 1 ρ

Robust PCA Data: Y R n m Robust PCA estimator of low-rank component: L RPCA := arg min L L + λ Y L 1 where λ > 0 is a regularization parameter Robust PCA estimator of sparse component: S RPCA := Y L RPCA 1 is the l 1 norm of the vectorized matrix

Example 2 1 α 1 2 Y := 2 1 0 1 2 2 1 0 1 2

Analysis of robust PCA estimator The robust PCA estimates of both components are exact for any value of α as long as 2 2 < λ < 30 3

Example 4 Low Rank Sparse 2 0 2 4 10-1 10 0 λ

Optimality + uniqueness condition Let Y := L + S where L, S R m n L = U L S L V T L has rank r, U L R m r, V L R n r, S L R r r Assume there exists G := U L V T L + W where W satisfies W < 1, U T W = 0, W V = 0, and there also exists a matrix G l1 satisfying G l1 [i, j] = sign (S [i, j]) if S [i, j] 0, (1) G l1 [i, j] < 1 otherwise, (2) such that G + λg l1 = 0 Then the solution to the robust PCA problem is unique and equal to L

Optimality + uniqueness condition G := U L V T L + W is a subgradient of the nuclear norm at L G l1 is a subgradient of Y 1 at L G + λg l1 is a subgradient of the cost function at L G + λg l1 = 0 implies that L is a solution (uniqueness is more difficult to prove)

Example 2 1 α 1 2 Y := 2 1 0 1 2 2 1 0 1 2 We want to show that the solution is 2 1 0 1 2 L := 2 1 0 1 2 2 1 0 1 2 0 0 α 0 0 S := 0 0 0 0 0 0 0 0 0 0

Example 2 1 0 1 2 L := 2 1 0 1 2 2 1 0 1 2

Example 2 1 0 1 2 L := 2 1 0 1 2 2 1 0 1 2 = 1 1 1 ( 1 [ 30 2 1 0 1 3 1 10 ] ) 2

Example 2 1 0 1 2 L := 2 1 0 1 2 2 1 0 1 2 = 1 1 1 ( 1 [ ] ) 30 2 1 0 1 2 3 1 10 U L VL T = 1 1 1 2 1 0 1 2 ] 30 1

Example G = U L VL T + W = 1 1 1 [ 2 1 0 1 2 ] + W 30 1

Example G = U L VL T + W = 1 1 1 [ 2 1 0 1 2 ] + W 30 1 g 1 g 2 sign (α) g 3 g 4 G l1 = g 5 g 6 g 7 g 8 g 9 g 10 g 11 g 12 g 13 g 14

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 + sign (α)

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 λ sign (α) + sign (α)

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 λ sign (α) + 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 λ sign (α) + 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2 WV = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 0 0 0 0 0 0 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2 WV = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 0 0 0 0 0 0 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2 WV = 0 U T W = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 1 1 2 2 2 WV = 0 U T W = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 WV = 0 U T W = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 G l1 [i, j] < 1 for S [i, j] = 0? WV = 0 U T W = 0

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 WV = 0 U T W = 0 G l1 [i, j] < 1 for S [i, j] = 0? λ > 2/ 30

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 WV = 0 U T W = 0 G l1 [i, j] < 1 for S [i, j] = 0? λ > 2/ 30 W < 1?

Example G + λg l1 = 2 1 0 1 2 1 2 1 0 1 2 30 + λ 2 1 0 1 2 0 0 λ sign (α) 0 0 + 0 0 λ sign(α) 0 0 2 0 0 λ sign(α) 0 0 2 2 2 2 1 1 1 sign (α) 1 λ sign(α) 1 2λ λ sign(α) 1 2λ 2 2 2 WV = 0 U T W = 0 G l1 [i, j] < 1 for S [i, j] = 0? λ > 2/ 30 W < 1? λ < 2/3

Applications Subgradients Optimization methods

Subgradient method Optimization problem minimize f ( x) where f is convex but nondifferentiable Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k g (k) where g (k) is a subgradient of f at x (k)

Least-squares regression with l 1 -norm regularization minimize 1 2 A x y 2 2 + λ x 1 Subgradient at x (k) g (k)

Least-squares regression with l 1 -norm regularization minimize 1 2 A x y 2 2 + λ x 1 Subgradient at x (k) ( ) g (k) = A T A x (k) y + λ sign ( x (k))

Least-squares regression with l 1 -norm regularization minimize 1 2 A x y 2 2 + λ x 1 Subgradient at x (k) ( ) g (k) = A T A x (k) y + λ sign ( x (k)) Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k (A T ( A x (k) y ) + λ sign ( x (k)))

Convergence of subgradient method It is not a descent method Convergence rate can be shown to be O ( 1/ɛ 2) Diminishing step sizes are necessary for convergence Experiment: minimize 1 2 A x y 2 2 + λ x 1 A R 2000 1000, y = A x + z where x is 100-sparse and z is iid Gaussian

Convergence of subgradient method 10 4 α 0 10 3 α 0 / k α 0 /k 10 2 f(x (k) ) f(x ) f(x ) 10 1 10 0 10-1 10-2 0 20 40 60 80 100 k

Convergence of subgradient method 10 1 10 0 α 0 α 0 / n α 0 /n f(x (k) ) f(x ) f(x ) 10-1 10-2 10-3 0 1000 2000 3000 4000 5000 k

Composite functions Interesting class of functions for data analysis f ( x) + h ( x) f convex and differentiable, h convex but not differentiable Example: 1 2 A x y 2 2 + λ x 1

Motivation Aim: Minimize convex differentiable function f Idea: Iteratively minimize first-order approximation, while staying close to current point x (0) = arbitrary initialization x (k+1) = arg min f x ( x (k)) + f ( x (k)) T ( x x (k)) + 1 where α k is a parameter that determines how close we stay x x (k) 2 α k 2 2

Motivation Linear approximation+ l 2 term is convex ( ( f x (k)) + f ( x (k)) T ( x x (k)) + 1 ) x x (k) 2 2 α k 2

Motivation Linear approximation+ l 2 term is convex ( ( f x (k)) + f ( x (k)) T ( x x (k)) + 1 ) x x (k) 2 2 α k 2 = f ( x (k)) x x (k) + α k

Motivation Linear approximation+ l 2 term is convex ( ( f x (k)) + f ( x (k)) T ( x x (k)) + 1 ) x x (k) 2 2 α k 2 = f ( x (k)) x x (k) + α k Setting the gradient to zero ( x (k+1) = arg min f x (k)) + f ( x (k)) T ( x x (k)) + 1 x x (k) x 2 α k 2 2

Motivation Linear approximation+ l 2 term is convex ( ( f x (k)) + f ( x (k)) T ( x x (k)) + 1 ) x x (k) 2 2 α k 2 = f ( x (k)) x x (k) + α k Setting the gradient to zero ( x (k+1) = arg min f x (k)) + f x = x (k) α k f ( x (k)) ( x (k)) T ( x x (k)) + 1 x x (k) 2 α k 2 2

Proximal gradient method Idea: Minimize local first-order approximation + h ( x (k+1) = arg min f x (k)) + f ( x (k)) T ( x x (k)) + 1 x x (k) x 2 α k + h ( x) 1 ( = arg min x x (k) α k f ( x (k))) 2 x 2 + α k h ( x) 2 ( = prox αk h x (k) α k f ( x (k))) 2 2 Proximal operator: prox h (y) := arg min x h ( x) + 1 2 y x 2 2

Proximal gradient method Method to solve the optimization problem minimize f ( x) + h ( x), where f is differentiable and prox h is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk h x (k) α k f ( x (k)))

Interpretation as a fixed-point method A vector x is a solution to minimize f ( x) + h ( x), if and only if it is a fixed point of the proximal-gradient iteration for any α > 0 x = prox α h ( x α f ( x ))

Proof x is the solution to min x α h ( x) + 1 2 x α f ( x ) x 2 2 (3) if and only if there is a subgradient g of h at x such that

Proof x is the solution to min x α h ( x) + 1 2 x α f ( x ) x 2 2 (3) if and only if there is a subgradient g of h at x such that α f ( x ) + α g = 0

Proof x is the solution to min x α h ( x) + 1 2 x α f ( x ) x 2 2 (3) if and only if there is a subgradient g of h at x such that α f ( x ) + α g = 0 x minimizes f + h if and only if there is a subgradient g of h at x such that

Proof x is the solution to min x α h ( x) + 1 2 x α f ( x ) x 2 2 (3) if and only if there is a subgradient g of h at x such that α f ( x ) + α g = 0 x minimizes f + h if and only if there is a subgradient g of h at x such that f ( x ) + g = 0

Proximal operator of l 1 norm The proximal operator of the l 1 norm is the soft-thresholding operator where α > 0 and S α (y) i := prox α 1 (y) = S α (y) { y i sign (y i ) α if y i α 0 otherwise

Proof α x 1 + 1 m 2 y x 2 2 = α x[i] + 1 ( y[i] x[i])2 2 We can just consider w (x) := α x + 1 2 (y x)2 = y 2 + x 2 + α x yx 2 i=1

Proof If x 0 w (x) = y 2 + x 2 w (x) = 2 (y α) x

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α)

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α) If y α minimum at

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α) If y α minimum at x := y α

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α) If y α minimum at x := y α If y < α minimum at

Proof If x 0 w (x) = y 2 + x 2 (y α) x 2 w (x) = x (y α) If y α minimum at x := y α If y < α minimum at 0

Proof If x < 0 w (x) = y 2 + x 2 w (x) = 2 (y + α) x

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α)

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α) If y α minimum at

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α) If y α minimum at x := y + α

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α) If y α minimum at x := y + α If y α minimum at

Proof If x < 0 w (x) = y 2 + x 2 (y + α) x 2 w (x) = x (y + α) If y α minimum at x := y + α If y α minimum at 0

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but w (y α)

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but w (y α) = α (y α) + α2 2

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but w (y α) = α (y α) + α2 2 = αy α2 2

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but because (y α) 2 0 w (y α) = α (y α) + α2 2 = αy α2 2 y 2 2 = w (0)

Proof If α y α minimum at x := 0 If y α minimum at x := y α or at x := 0, but because (y α) 2 0 Same argument for y < α w (y α) = α (y α) + α2 2 = αy α2 2 y 2 2 = w (0)

Iterative Shrinkage-Thresholding Algorithm (ISTA) The proximal gradient method for the problem is called ISTA minimize 1 2 A x y 2 2 + λ x 1 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = S αk λ x (k) α k A T A x (k) y

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) ISTA can be accelerated using Nesterov s accelerated gradient method FISTA iteration: x (0) = arbitrary initialization z (0) = x (0) ( ( )) x (k+1) = S αk λ z (k) α k A T A z (k) y z (k+1) = x (k+1) + k ( x (k+1) x (k)) k + 3

Convergence of proximal gradient method Without acceleration: Descent method Convergence rate can be shown to be O (1/ɛ) with constant step or backtracking line search With acceleration: Not a descent method ( ) Convergence rate can be shown to be O 1 ɛ backtracking line search Experiment: minimize 1 2 A x y 2 2 + λ x 1 with constant step or A R 2000 1000, y = A x 0 + z, x 0 100-sparse and z iid Gaussian

Convergence of proximal gradient method 10 4 10 3 10 2 10 1 Subg. method (α 0 / k) ISTA FISTA f(x (k) ) f(x ) f(x ) 10 0 10-1 10-2 10-3 10-4 10-5 0 20 40 60 80 100 k

Coordinate descent Idea: Solve the n-dimensional problem minimize c ( x[1], x[2],..., x[n]) by solving a sequence of 1D problems Coordinate-descent iteration: x (0) = arbitrary initialization x (k+1) [i] = arg min c α ( x (k) [1],..., α,..., x (k) [n] ) for some 1 i n

Coordinate descent Convergence is guaranteed for functions of the form n f ( x) + h i ( x[i]) where f is convex and differentiable and h 1,..., h n are convex i=1

Least-squares regression with l 1 -norm regularization h ( x) := 1 2 A x y 2 2 + λ x 1 The solution to the subproblem min x[i] h ( x[1],..., x[i],..., x[n]) is x [i] = S λ (γ i ) A i 2 2 where A i is the ith column of A and m γ i := A li y[l] l=1 j i A lj x[j]