Gap Safe Screening Rules for Sparse-Group Lasso Eugene Ndiaye Olivier Fercoq, Alexandre Gramfort, Joseph Salmon LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France Olivier Alexandre Joseph
Sparse regression y P R n : signal X rx 1,..., x p s P R nˆp : design matrix y Xβ ` ε where ε is a noise and p " n. Objective: approximate y «X with a sparse vector P R p The Lasso estimator Tibshirani (1996): Ωpβq }β} 1 P arg min βpr p ˆ 1 2 y Xβ 2 2 loooooomoooooon ` λ Ωpβq loomoon data fitting sparsity
Sparse regression y P R n : signal X rx 1,..., x p s P R nˆp : design matrix y Xβ ` ε where ε is a noise and p " n. Objective: approximate y «X with a sparse vector P R p The Lasso estimator Tibshirani (1996): Ωpβq }β} 1 P arg min βpr p ˆ 1 2 y Xβ 2 2 loooooomoooooon ` λ Ωpβq loomoon data fitting sparsity
Sparsity inducing norms l 1 : Ωpβq }β} 1 l 1 {l 2 : Ωpβq ÿ gpg w g β g 2 l 1 l 1 {l 2 : Ωpβq τ β 1 ` p1 τq ÿ gpg w g β g 2 where w g ě 0 is the weight of the group g and τ P p0, 1q.
Climate dataset Group of 7 features [Air Temperature, Precipitable water, Relative humidity, Pressure, Sea Level Pressure, Horizontal Wind Speed, Vertical Wind Speed]
Convex optimization problem P arg min βpr p ˆ 1 2 y Xβ 2 2 loooooomoooooon data fitting ` λ Ωpβq loomoon sparsity Algorithms ISTA/FISTA Beck & Teboulle (2009) Coordinate descent Friedman et al. (2007)
Convex optimization problem P arg min βpr p ˆ 1 2 y Xβ 2 2 loooooomoooooon data fitting ` λ Ωpβq loomoon sparsity Algorithms ISTA/FISTA Beck & Teboulle (2009) Coordinate descent Friedman et al. (2007)
How can we speed-up those algorithms by using sparsity information? Support: p S λ : tj P rps, Sparsity: p S λ ăă p. j 0u.
How can we speed-up those algorithms by using sparsity information? Support: p S λ : tj P rps, Sparsity: p S λ ăă p. j 0u. Idea: Solve the optimization problem by restricting to the support
How can we speed-up those algorithms by using sparsity information? Support: p S λ : tj P rps, Sparsity: p S λ ăă p. j 0u. Idea: Solve the optimization problem by restricting to the support The support p S λ is unknown!!!
How can we speed-up those algorithms by using sparsity information? Support: p S λ : tj P rps, Sparsity: p S λ ăă p. j 0u. Idea: Solve the optimization problem by restricting to the support The support p S λ is unknown!!!
Optimality condition Bf px q tz P R d : @y P R d, f pyq ě f px q ` z J py x qu If f is differentiable at x : Bf px q t f px qu. Fermat s rule : For any convex function f : R d Ñ R: x P arg min xpr d f pxq ðñ 0 P Bf px q.
Optimality condition Bf px q tz P R d : @y P R d, f pyq ě f px q ` z J py x qu If f is differentiable at x : Bf px q t f px qu. Fermat s rule : For any convex function f : R d Ñ R: x P arg min xpr d f pxq ðñ 0 P Bf px q.
Optimality condition Bf px q tz P R d : @y P R d, f pyq ě f px q ` z J py x qu If f is differentiable at x : Bf px q t f px qu. Fermat s rule : For any convex function f : R d Ñ R: x P arg min xpr d f pxq ðñ 0 P Bf px q.
Optimality condition Bf px q tz P R d : @y P R d, f pyq ě f px q ` z J py x qu If f is differentiable at x : Bf px q t f px qu. Fermat s rule : For any convex function f : R d Ñ R: x P arg min xpr d f pxq ðñ 0 P Bf px q.
Critical threshold: λ max Objective primal function: Fermat s rule: P λ pβq 1 2 y Xβ 2 2 ` λωpβq. x P arg min xpr d f pxq ðñ 0 P Bf px q. 0 P arg min P λ pβq ðñ 0 P BP λ p0q X J y ( ` λbωp0q βpr p ðñ Ω D px J yq ď λ Dual norm: Ω D pξq : max xβ, ξy Ωpβqď1
Critical threshold: λ max Objective primal function: Fermat s rule: P λ pβq 1 2 y Xβ 2 2 ` λωpβq. x P arg min xpr d f pxq ðñ 0 P Bf px q. 0 P arg min P λ pβq ðñ 0 P BP λ p0q X J y ( ` λbωp0q βpr p ðñ Ω D px J yq ď λ Dual norm: Ω D pξq : max xβ, ξy Ωpβqď1
First screening rule 0 P arg min P λ pβq ðñ Ω D px J yq ď λ βpr p Let λ max : Ω D px J yq. For all λ ě λ max, we have 0. From now on, we only consider the case λ ă λ max.
Duality Primal problem Dual problem Feasible set Strong duality P arg min 2 y Xβ 2 2 looooooooooooomooooooooooooon ` λωpβq βpr p 1 P λ pβq ˆθ 1 arg max θp X 2 y 2 2 λ2 2 θ y 2 λ looooooooooooomooooooooooooon2 D λ pθq X tθ P R n : Ω D px J θq ď 1u P λ p q D λ pˆθ q KKT s optimality conditions λˆθ y X plink equationq, X J ˆθ P BΩp q psub-differential inclusionq.
Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0.
Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Proof: sub-differential inclusion @g P G, X J g ˆθ P BΩ g p g q.
Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Proof: sub-differential inclusion @g P G, Xg J ˆθ P BΩ g p g q. Sub-differential of a norm # tz P R d : Ω D pzq ď 1u B BΩpxq Ω D, if x 0, tz P R d : Ω D pzq 1 and z J x Ωpxqu, otherwise.
Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Proof: sub-differential inclusion @g P G, Xg J ˆθ P BΩ g p g q. Sub-differential of a norm # tz P R d : Ω D pzq ď 1u B BΩpxq Ω D, if x 0, tz P R d : Ω D pzq 1 and z J x Ωpxqu, otherwise. g 0 ùñ Ω D g px J g ˆθ q 1.
Screening rules for separable norms Ωpβq ÿ gpg Ω g pβ g q, Ω D pβq max gpg ΩD g pβ g q @λ ą 0, @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Proof: sub-differential inclusion @g P G, Xg J ˆθ P BΩ g p g q. Sub-differential of a norm # tz P R d : Ω D pzq ď 1u B BΩpxq Ω D, if x 0, tz P R d : Ω D pzq 1 and z J x Ωpxqu, otherwise. g 0 ùñ Ω D g px J g ˆθ q 1.
Screening rule for Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Ωpβq β 1, Ω D pξq max jprps ξ j @j P rps, X J j ˆθ ă 1 ùñ j 0
Screening rule for Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Ωpβq ÿ gpg w g β g 2, Ω D pξq max gpg @g P G, X J g ˆθ 2 w g ă 1 ùñ g 0 ξ g 2 w g
Screening rule for Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Ωpβq ÿ gpg τ β g 1 ` p1 τqw g β g 2 Ω D pξq max gpg ξ g ɛg τ ` p1 τqw g ɛ-norm Burdakov (1988), Burdakov & Merkulov (2002) ɛ P r0, 1s, x ɛ is the solution in ν of dÿ p x i p1 ɛqνq 2` pɛνq 2. i 1
Screening rule for Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. Ωpβq ÿ gpg τ β g 1 ` p1 τqw g β g 2 Ω D pξq max gpg ξ g ɛg τ ` p1 τqw g ɛ-norm Burdakov (1988), Burdakov & Merkulov (2002) ɛ P r0, 1s, x ɛ is the solution in ν of dÿ p x i p1 ɛqνq 2` pɛνq 2. i 1
Screening for the Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ ξ Ω D g ɛg g pξ g q τ ` p1 τqw g Group level screening: X J ˆθ g ɛg @g P G, ă 1 ùñ τ ` p1 τqw g g 0. g 0.
Screening for the Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ ξ Ω D g ɛg g pξ g q τ ` p1 τqw g Group level screening: X J ˆθ g ɛg @g P G, ă 1 ùñ τ ` p1 τqw g Feature level screening: g 0. g 0. @j P g, X J j ˆθ ă τ ùñ j 0.
Screening for the Sparse-Group Lasso @g P G, Ω D g X J g ˆθ ă 1 ùñ ξ Ω D g ɛg g pξ g q τ ` p1 τqw g Group level screening: X J ˆθ g ɛg @g P G, ă 1 ùñ τ ` p1 τqw g Feature level screening: g 0. g 0. @j P g, X J j ˆθ ă τ ùñ j 0.
Screening rule for separable norms @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. ˆθ IS UNKNOWN!!!
Screening rule for separable norms @g P G, Ω D g X J g ˆθ ă 1 ùñ g 0. ˆθ IS UNKNOWN!!!
Safe screening rules Find a Safe region R such that ˆθ P R Ω D g X J g ˆθ ă max θpr ΩD g `X J g θ ă 1 ùñ g 0.
Safe screening rules Find a Safe region R such that ˆθ P R Ω D g X J g ˆθ ă max θpr ΩD g `X J g θ ă 1 ùñ g 0. Desirable properties of a Safe region R as small as possible Computation of max θpr Ω D g `X J g θ is cheap
Safe screening rules Find a Safe region R such that ˆθ P R Ω D g X J g ˆθ ă max θpr ΩD g `X J g θ ă 1 ùñ g 0. Desirable properties of a Safe region R as small as possible Computation of max θpr Ω D g `X J g θ is cheap Ball as a safe region: Bpc, rq Ω D g `X J g c ` rω D g px g q ă 1 ùñ g 0.
Safe screening rules Find a Safe region R such that ˆθ P R Ω D g X J g ˆθ ă max θpr ΩD g `X J g θ ă 1 ùñ g 0. Desirable properties of a Safe region R as small as possible Computation of max θpr Ω D g `X J g θ is cheap Ball as a safe region: Bpc, rq Ω D g `X J g c ` rω D g px g q ă 1 ùñ g 0.
How to construct a safe region R such that ˆθ P R?
Geometrical interpretation of the dual ˆθ 1 arg max θp X 2 y 2 λ2 2 θ y 2 λ X tθ P R n : Ω D px J θq ď 1u. ˆθ arg min θp X θ y y λ : Π X λ.
Geometrical interpretation of the dual ˆθ 1 arg max θp X 2 y 2 λ2 2 θ y 2 λ X tθ P R n : Ω D px J θq ď 1u. ˆθ arg min θp X θ y y λ : Π X λ.
Visualization of the feasible set X tθ P R n : Ω D px J θq ď 1u. (a) Lasso (b) Group-Lasso (c) Sparse-Group Lasso
Seminal Safe Region ˆ El Ghaoui et al. (2012) : ˆθ y P B λ, y y λ max λ
Dynamic Safe Region Bonnefoy et al. (2014) : ˆθ P B ` y λ, θ k y λ
Dynamic Safe Region Bonnefoy et al. (2014) : ˆθ P B ` y λ, θ k y λ
Dynamic Safe Region Bonnefoy et al. (2014) : ˆθ P B ` y λ, θ k y λ
Critical limitations of those methods Can we do better?
Critical limitations of those methods Can we do better?
Yes we can! R Bpθ, rq Theoretical screening rule Ω D g Xg J ˆθ ă 1 ùñ g 0. Safe sphere test Ω D g `X J g c ` rω D g px g q ă 1 ùñ g 0. Objectives c as close as possible to r as small as possible ˆθ
Yes we can! R Bpθ, rq Theoretical screening rule Ω D g Xg J ˆθ ă 1 ùñ g 0. Safe sphere test Ω D g `X J g c ` rω D g px g q ă 1 ùñ g 0. Objectives c as close as possible to r as small as possible ˆθ
Mind the duality gap ;-) Fercoq, Gramfort & Salmon (ICML 2015) For all θ P X and all β P R p : ˆθ P B pθ, r λ pβ, θqq where r λ pβ, θq c 2pPλ pβq D λ pθqq λ 2 c 2Gλ pβ, θq λ 2.
How to compute a θ P X? X tθ P R n : Ω D px J θq ď 1u pfeasible setq ˆθ y X λ θ k y Xβ k α plink equationq with α st Ω D px J θ k q ď 1
How to compute a θ P X? X tθ P R n : Ω D px J θq ď 1u pfeasible setq ˆθ y X λ θ k y Xβ k α plink equationq with α st Ω D px J θ k q ď 1 α maxpλ, Ω D px J ρ k qq where ρ k y Xβ k.
How to compute a θ P X? X tθ P R n : Ω D px J θq ď 1u pfeasible setq ˆθ y X λ θ k y Xβ k α plink equationq with α st Ω D px J θ k q ď 1 α maxpλ, Ω D px J ρ k qq where ρ k y Xβ k.
Convergence of the Gap Safe region lim β k ùñ lim θ k ˆθ kpn kpn lim G λ pβ k, θ k q G λ p, ˆθ q 0 kpn c 2Gλ pβ k, θ k q lim r λ pβ k, θ k q lim kpn kpn λ 2 0 c 2Gλ pβ k, θ k q B θ k, ÝÑ tˆθ u λ 2
Convergence of the Gap Safe region lim β k ùñ lim θ k ˆθ kpn kpn lim G λ pβ k, θ k q G λ p, ˆθ q 0 kpn c 2Gλ pβ k, θ k q lim r λ pβ k, θ k q lim kpn kpn λ 2 0 c 2Gλ pβ k, θ k q B θ k, ÝÑ tˆθ u λ 2
Gap safe sphere c ˆθ 2Gλ pβ k, θ k q P B θ k, λ 2
Gap safe sphere c ˆθ 2Gλ pβ k, θ k q P B θ k, λ 2
Gap safe sphere c ˆθ 2Gλ pβ k, θ k q P B θ k, λ 2
Algorithm with Gap Safe rules Algorithm 1 Gap Safe screening algorithm Input : X, y, K, f ce, λ for k P rks do if k mod f ce 1 then Compute θ P X ˆ b and set R B θ, 2pP λ pβq D λ pθqq λ 2 // Compute the active set A R if Stopping criterion is met then break β SolverUpdatepX AR, y, β AR q // restricted to active set Output: β
Numerical experiments Figure: Proportion of screened variables on Leukemia dataset n 72 samples and p 7129 features.
Numerical experiments Figure: Computational time on Climate dataset n 814, p 73577.
GitHub : https://github.com/eugenendiaye Page Web : http://perso.telecom-paristech.fr/endiaye