Gap Safe Screening Rules for Sparse-Group Lasso

Σχετικά έγγραφα
Nondifferentiable Convex Functions

70. Let Y be a metrizable topological space and let A Ď Y. Show that Cl Y A scl Y A.

Problem Set 3: Solutions

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Monetary Policy Design in the Basic New Keynesian Model

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Numerical Analysis FMN011

6. MAXIMUM LIKELIHOOD ESTIMATION

Fractional Colorings and Zykov Products of graphs

D Alembert s Solution to the Wave Equation

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

The ε-pseudospectrum of a Matrix

Solution Series 9. i=1 x i and i=1 x i.

Other Test Constructions: Likelihood Ratio & Bayes Tests

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

2 Composition. Invertible Mappings

The Simply Typed Lambda Calculus

Parametrized Surfaces

EE512: Error Control Coding

w o = R 1 p. (1) R = p =. = 1

Statistical Inference I Locally most powerful tests

Tridiagonal matrices. Gérard MEURANT. October, 2008

Empirical best prediction under area-level Poisson mixed models

Tutorial on Multinomial Logistic Regression

Uniform Convergence of Fourier Series Michael Taylor

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Module 5. February 14, h 0min

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

ΠΣΤΥΙΑΚΗ ΔΡΓΑΙΑ. Μειέηε Υξόλνπ Απνζηείξσζεο Κνλζέξβαο κε Τπνινγηζηηθή Ρεπζηνδπλακηθή. Αζαλαζηάδνπ Βαξβάξα

Lecture 2. Soundness and completeness of propositional logic

FORMULAS FOR STATISTICS 1

4.6 Autoregressive Moving Average Model ARMA(1,1)

Figure A.2: MPC and MPCP Age Profiles (estimating ρ, ρ = 2, φ = 0.03)..

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Homework 8 Model Solution Section

The challenges of non-stable predicates

C.S. 430 Assignment 6, Sample Solutions

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Areas and Lengths in Polar Coordinates

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

Higher Derivative Gravity Theories

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Econ Spring 2004 Instructor: Prof. Kiefer Solution to Problem set # 5. γ (0)

From the finite to the transfinite: Λµ-terms and streams

ST5224: Advanced Statistical Theory II

MATRIX INVERSE EIGENVALUE PROBLEM

Notes on the Open Economy

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Matrices and Determinants

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Example Sheet 3 Solutions

forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with

A Lambda Model Characterizing Computational Behaviours of Terms

1. (a) (5 points) Find the unit tangent and unit normal vectors T and N to the curve. r(t) = 3cost, 4t, 3sint

Elements of Information Theory

10.7 Performance of Second-Order System (Unit Step Response)

Srednicki Chapter 55

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Integrals in cylindrical, spherical coordinates (Sect. 15.7)

Finite difference method for 2-D heat equation

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Probability and Random Processes (Part II)

Areas and Lengths in Polar Coordinates

TMA4115 Matematikk 3

Section 8.3 Trigonometric Equations

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

Exercises to Statistics of Material Fatigue No. 5

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Second Order RLC Filters

Appendix to On the stability of a compressible axisymmetric rotating flow in a pipe. By Z. Rusak & J. H. Lee

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Aquinas College. Edexcel Mathematical formulae and statistics tables DO NOT WRITE ON THIS BOOKLET

Exercise 2: The form of the generalized likelihood ratio

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Solutions to Exercise Sheet 5

CE 530 Molecular Simulation

On the Galois Group of Linear Difference-Differential Equations

A Carleman estimate and the balancing principle in the Quasi-Reversibility method for solving the Cauchy problem for the Laplace equation

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Lecture 34: Ridge regression and LASSO

Wavelet based matrix compression for boundary integral equations on complex geometries

Iterated trilinear fourier integrals with arbitrary symbols

Γραµµικός Προγραµµατισµός (ΓΠ)

Differential equations

( y) Partial Differential Equations

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Local Approximation with Kernels

Durbin-Levinson recursive method

Answers - Worksheet A ALGEBRA PMT. 1 a = 7 b = 11 c = 1 3. e = 0.1 f = 0.3 g = 2 h = 10 i = 3 j = d = k = 3 1. = 1 or 0.5 l =

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Every set of first-order formulas is equivalent to an independent set

Applying Markov Decision Processes to Role-playing Game

Chap. 6 Pushdown Automata

ECON 381 SC ASSIGNMENT 2

Transcript:

Gap Safe Screening Rules for Sparse-Group Lasso Eugene Ndiaye Olivier Fercoq, Alexandre Gramfort, Joseph Salmon LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France Olivier Alexandre Joseph

Sparse regression y P R n : signal X rx 1,..., x p s P R nˆp : design matrix y Xβ ` ε where ε is a noise and p " n. Objective: approximate y «X with a sparse vector P R p The Lasso estimator Tibshirani (1996): Ωpβq }β} 1 P arg min βpr p ˆ 1 2 y Xβ 2 2 loooooomoooooon ` λ Ωpβq loomoon data fitting sparsity

Sparse regression y P R n : signal X rx 1,..., x p s P R nˆp : design matrix y Xβ ` ε where ε is a noise and p " n. Objective: approximate y «X with a sparse vector P R p The Lasso estimator Tibshirani (1996): Ωpβq }β} 1 P arg min βpr p ˆ 1 2 y Xβ 2 2 loooooomoooooon ` λ Ωpβq loomoon data fitting sparsity

Sparsity inducing norms l 1 : Ωpβq }β} 1 l 1 {l 2 : Ωpβq ÿ gpg w g β g 2 l 1 l 1 {l 2 : Ωpβq τ β 1 ` p1 τq ÿ gpg w g β g 2 where w g ě 0 is the weight of the group g and τ P p0, 1q.

Climate dataset Group of 7 features [Air Temperature, Precipitable water, Relative humidity, Pressure, Sea Level Pressure, Horizontal Wind Speed, Vertical Wind Speed]

Convex optimization problem P arg min βpr p ˆ 1 2 y Xβ 2 2 loooooomoooooon data fitting ` λ Ωpβq loomoon sparsity Algorithms ISTA/FISTA Beck & Teboulle (2009) Coordinate descent Friedman et al. (2007)

Convex optimization problem P arg min βpr p ˆ 1 2 y Xβ 2 2 loooooomoooooon data fitting ` λ Ωpβq loomoon sparsity Algorithms ISTA/FISTA Beck & Teboulle (2009) Coordinate descent Friedman et al. (2007)

How can we speed-up those algorithms by using sparsity information? Support: p S λ : tj P rps, Sparsity: p S λ ăă p. j 0u.

How can we speed-up those algorithms by using sparsity information? Support: p S λ : tj P rps, Sparsity: p S λ ăă p. j 0u. Idea: Solve the optimization problem by restricting to the support

How can we speed-up those algorithms by using sparsity information? Support: p S λ : tj P rps, Sparsity: p S λ ăă p. j 0u. Idea: Solve the optimization problem by restricting to the support The support p S λ is unknown!!!

How can we speed-up those algorithms by using sparsity information? Support: p S λ : tj P rps, Sparsity: p S λ ăă p. j 0u. Idea: Solve the optimization problem by restricting to the support The support p S λ is unknown!!!

Optimality condition Bf px q tz P R d : @y P R d, f pyq ě f px q ` z J py x qu If f is differentiable at x : Bf px q t f px qu. Fermat s rule : For any convex function f : R d Ñ R: x P arg min xpr d f pxq ðñ 0 P Bf px q.

Optimality condition Bf px q tz P R d : @y P R d, f pyq ě f px q ` z J py x qu If f is differentiable at x : Bf px q t f px qu. Fermat s rule : For any convex function f : R d Ñ R: x P arg min xpr d f pxq ðñ 0 P Bf px q.

Optimality condition Bf px q tz P R d : @y P R d, f pyq ě f px q ` z J py x qu If f is differentiable at x : Bf px q t f px qu. Fermat s rule : For any convex function f : R d Ñ R: x P arg min xpr d f pxq ðñ 0 P Bf px q.

Optimality condition Bf px q tz P R d : @y P R d, f pyq ě f px q ` z J py x qu If f is differentiable at x : Bf px q t f px qu. Fermat s rule : For any convex function f : R d Ñ R: x P arg min xpr d f pxq ðñ 0 P Bf px q.

Critical threshold: λ max Objective primal function: Fermat s rule: P λ pβq 1 2 y Xβ 2 2 ` λωpβq. x P arg min xpr d f pxq ðñ 0 P Bf px q. 0 P arg min P λ pβq ðñ 0 P BP λ p0q X J y ( ` λbωp0q βpr p ðñ Ω D px J yq ď λ Dual norm: Ω D pξq : max xβ, ξy Ωpβqď1

Critical threshold: λ max Objective primal function: Fermat s rule: P λ pβq 1 2 y Xβ 2 2 ` λωpβq. x P arg min xpr d f pxq ðñ 0 P Bf px q. 0 P arg min P λ pβq ðñ 0 P BP λ p0q X J y ( ` λbωp0q βpr p ðñ Ω D px J yq ď λ Dual norm: Ω D pξq : max xβ, ξy Ωpβqď1

First screening rule 0 P arg min P λ pβq ðñ Ω D px J yq ď λ βpr p Let λ max : Ω D px J yq. For all λ ě λ max, we have 0. From now on, we only consider the case λ ă λ max.

Duality Primal problem Dual problem Feasible set Strong duality P arg min 2 y Xβ 2 2 looooooooooooomooooooooooooon ` λωpβq βpr p 1 P λ pβq ˆθ 1 arg max θp X 2 y 2 2 λ2 2 θ y 2 λ looooooooooooomooooooooooooon2 D λ pθq X tθ P R n : Ω D px J θq ď 1u P λ p q D λ pˆθ q KKT s optimality conditions λˆθ y X plink equationq, X J ˆθ P BΩp q psub-differential inclusionq.

Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0.

Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Proof: sub-differential inclusion @g P G, X J g ˆθ P BΩ g p g q.

Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Proof: sub-differential inclusion @g P G, Xg J ˆθ P BΩ g p g q. Sub-differential of a norm # tz P R d : Ω D pzq ď 1u B BΩpxq Ω D, if x 0, tz P R d : Ω D pzq 1 and z J x Ωpxqu, otherwise.

Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Proof: sub-differential inclusion @g P G, Xg J ˆθ P BΩ g p g q. Sub-differential of a norm # tz P R d : Ω D pzq ď 1u B BΩpxq Ω D, if x 0, tz P R d : Ω D pzq 1 and z J x Ωpxqu, otherwise. g 0 ùñ Ω D g px J g ˆθ q 1.

Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Proof: sub-differential inclusion @g P G, Xg J ˆθ P BΩ g p g q. Sub-differential of a norm # tz P R d : Ω D pzq ď 1u B BΩpxq Ω D, if x 0, tz P R d : Ω D pzq 1 and z J x Ωpxqu, otherwise. g 0 ùñ Ω D g px J g ˆθ q 1.

Screening rule for Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Ωpβq β 1, Ω D pξq max jprps ξ j @j P rps, X J j ˆθ ă 1 ùñ j 0

Screening rule for Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Ωpβq ÿ gpg w g β g 2, Ω D pξq max gpg @g P G, X J g ˆθ 2 w g ă 1 ùñ g 0 ξ g 2 w g

Screening rule for Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Ωpβq ÿ gpg τ β g 1 ` p1 τqw g β g 2 Ω D pξq max gpg ξ g ɛg τ ` p1 τqw g ɛ-norm Burdakov (1988), Burdakov & Merkulov (2002) ɛ P r0, 1s, x ɛ is the solution in ν of dÿ p x i p1 ɛqνq 2` pɛνq 2. i 1

Screening rule for Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Ωpβq ÿ gpg τ β g 1 ` p1 τqw g β g 2 Ω D pξq max gpg ξ g ɛg τ ` p1 τqw g ɛ-norm Burdakov (1988), Burdakov & Merkulov (2002) ɛ P r0, 1s, x ɛ is the solution in ν of dÿ p x i p1 ɛqνq 2` pɛνq 2. i 1

Screening for the Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ ξ Ω D g ɛg g pξ g q τ ` p1 τqw g Group level screening: X J ˆθ g ɛg @g P G, ă 1 ùñ τ ` p1 τqw g g 0. g 0.

Screening for the Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ ξ Ω D g ɛg g pξ g q τ ` p1 τqw g Group level screening: X J ˆθ g ɛg @g P G, ă 1 ùñ τ ` p1 τqw g Feature level screening: g 0. g 0. @j P g, X J j ˆθ ă τ ùñ j 0.

Screening for the Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ ξ Ω D g ɛg g pξ g q τ ` p1 τqw g Group level screening: X J ˆθ g ɛg @g P G, ă 1 ùñ τ ` p1 τqw g Feature level screening: g 0. g 0. @j P g, X J j ˆθ ă τ ùñ j 0.

Screening rule for separable norms @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. ˆθ IS UNKNOWN!!!

Screening rule for separable norms @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. ˆθ IS UNKNOWN!!!

Safe screening rules Find a Safe region R such that ˆθ P R Ω D g X J g ˆθ ă max θpr ΩD g `X J g θ ă 1 ùñ g 0.

Safe screening rules Find a Safe region R such that ˆθ P R Ω D g X J g ˆθ ă max θpr ΩD g `X J g θ ă 1 ùñ g 0. Desirable properties of a Safe region R as small as possible Computation of max θpr Ω D g `X J g θ is cheap

Safe screening rules Find a Safe region R such that ˆθ P R Ω D g X J g ˆθ ă max θpr ΩD g `X J g θ ă 1 ùñ g 0. Desirable properties of a Safe region R as small as possible Computation of max θpr Ω D g `X J g θ is cheap Ball as a safe region: Bpc, rq Ω D g `X J g c ` rω D g px g q ă 1 ùñ g 0.

Safe screening rules Find a Safe region R such that ˆθ P R Ω D g X J g ˆθ ă max θpr ΩD g `X J g θ ă 1 ùñ g 0. Desirable properties of a Safe region R as small as possible Computation of max θpr Ω D g `X J g θ is cheap Ball as a safe region: Bpc, rq Ω D g `X J g c ` rω D g px g q ă 1 ùñ g 0.

How to construct a safe region R such that ˆθ P R?

Geometrical interpretation of the dual ˆθ 1 arg max θp X 2 y 2 λ2 2 θ y 2 λ X tθ P R n : Ω D px J θq ď 1u. ˆθ arg min θp X θ y y λ : Π X λ.

Geometrical interpretation of the dual ˆθ 1 arg max θp X 2 y 2 λ2 2 θ y 2 λ X tθ P R n : Ω D px J θq ď 1u. ˆθ arg min θp X θ y y λ : Π X λ.

Visualization of the feasible set X tθ P R n : Ω D px J θq ď 1u. (a) Lasso (b) Group-Lasso (c) Sparse-Group Lasso

Seminal Safe Region ˆ El Ghaoui et al. (2012) : ˆθ y P B λ, y y λ max λ

Dynamic Safe Region Bonnefoy et al. (2014) : ˆθ P B ` y λ, θ k y λ

Dynamic Safe Region Bonnefoy et al. (2014) : ˆθ P B ` y λ, θ k y λ

Dynamic Safe Region Bonnefoy et al. (2014) : ˆθ P B ` y λ, θ k y λ

Critical limitations of those methods Can we do better?

Critical limitations of those methods Can we do better?

Yes we can! R Bpθ, rq Theoretical screening rule Ω D g Xg J ˆθ ă 1 ùñ g 0. Safe sphere test Ω D g `X J g c ` rω D g px g q ă 1 ùñ g 0. Objectives c as close as possible to r as small as possible ˆθ

Yes we can! R Bpθ, rq Theoretical screening rule Ω D g Xg J ˆθ ă 1 ùñ g 0. Safe sphere test Ω D g `X J g c ` rω D g px g q ă 1 ùñ g 0. Objectives c as close as possible to r as small as possible ˆθ

Mind the duality gap ;-) Fercoq, Gramfort & Salmon (ICML 2015) For all θ P X and all β P R p : ˆθ P B pθ, r λ pβ, θqq where r λ pβ, θq c 2pPλ pβq D λ pθqq λ 2 c 2Gλ pβ, θq λ 2.

How to compute a θ P X? X tθ P R n : Ω D px J θq ď 1u pfeasible setq ˆθ y X λ θ k y Xβ k α plink equationq with α st Ω D px J θ k q ď 1

How to compute a θ P X? X tθ P R n : Ω D px J θq ď 1u pfeasible setq ˆθ y X λ θ k y Xβ k α plink equationq with α st Ω D px J θ k q ď 1 α maxpλ, Ω D px J ρ k qq where ρ k y Xβ k.

How to compute a θ P X? X tθ P R n : Ω D px J θq ď 1u pfeasible setq ˆθ y X λ θ k y Xβ k α plink equationq with α st Ω D px J θ k q ď 1 α maxpλ, Ω D px J ρ k qq where ρ k y Xβ k.

Convergence of the Gap Safe region lim β k ùñ lim θ k ˆθ kpn kpn lim G λ pβ k, θ k q G λ p, ˆθ q 0 kpn c 2Gλ pβ k, θ k q lim r λ pβ k, θ k q lim kpn kpn λ 2 0 c 2Gλ pβ k, θ k q B θ k, ÝÑ tˆθ u λ 2

Convergence of the Gap Safe region lim β k ùñ lim θ k ˆθ kpn kpn lim G λ pβ k, θ k q G λ p, ˆθ q 0 kpn c 2Gλ pβ k, θ k q lim r λ pβ k, θ k q lim kpn kpn λ 2 0 c 2Gλ pβ k, θ k q B θ k, ÝÑ tˆθ u λ 2

Gap safe sphere c ˆθ 2Gλ pβ k, θ k q P B θ k, λ 2

Gap safe sphere c ˆθ 2Gλ pβ k, θ k q P B θ k, λ 2

Gap safe sphere c ˆθ 2Gλ pβ k, θ k q P B θ k, λ 2

Algorithm with Gap Safe rules Algorithm 1 Gap Safe screening algorithm Input : X, y, K, f ce, λ for k P rks do if k mod f ce 1 then Compute θ P X ˆ b and set R B θ, 2pP λ pβq D λ pθqq λ 2 // Compute the active set A R if Stopping criterion is met then break β SolverUpdatepX AR, y, β AR q // restricted to active set Output: β

Numerical experiments Figure: Proportion of screened variables on Leukemia dataset n 72 samples and p 7129 features.

Numerical experiments Figure: Computational time on Climate dataset n 814, p 73577.

GitHub : https://github.com/eugenendiaye Page Web : http://perso.telecom-paristech.fr/endiaye