Gradient Descent for Optimization Problems With Sparse Solutions The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Chen, Hsieh-Chung. 2016. Gradient Descent for Optimization Problems With Sparse Solutions. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences. Citable link http://nrs.harvard.edu/urn-3:hul.instrepos:33493549 Terms of Use This article was downloaded from Harvard University s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:hul.instrepos:dash.current.terms-ofuse#laa
l 0
0
l 1
l 0 Sparse solution path Standard solution path Optimal solution l 1
1
min {Q(z) =f(z)+λ g(z)} z Rn f : R n R g : R n R λ 0 f N {x i,y i } N M z y x z ŷ = M z (x i ) y i
min z { f(z) 1 N } N (M z (x i ),y i ) i (ŷ, y) =(ŷ y) 2 (ŷ, y) = log(1 + expŷy ) (ŷ, y) =(1 ŷy) + g(z) z C 0 : z C g(z) = : z/ C l 0 l 1 l 0 g(z) = z 0 l 1 l p 0 p 1
inf 2 1 0.5 0 l p l 1 p l 1 l 1 g(z) = n max( z i,τ) i τ > 0 l 1 l 1 l 0 l 1 g(z) = τ>0 n i log(1 + z i τ )
l 1 l 1 l 0 l 1 l 1 g(z) = θ y : y <θ n ĝ(z i ) ĝ(y) = : θ<= y <θτ i y 2 +2τθ y θ 2 2(τ 1) 1 2 (τ + 1)θ2 : θτ <= y θ>0 τ>2 g(z) = n θ y 1 2τ ĝ(z i ) ĝ(y) = y2 : y <τθ i 1 2 τθ2 : τθ <= y
l 1 g(z) = z 1 + τ z 2 2 l 1 λ l 1 /l p g(z) = g G w g z g p G l p l 2 l w g
1 1 2 3 4 5 2 5 1 2 3 4 5 1 2 3 4 5 3 4 6 7 8 1 2 3 4 5 f Ω L x, y Ω: f(x) f(y) L x y f L f L L f f(z) f(x) f(x),z x L z x 2 2 l p l q 1 p + 1 q =1 l 2
x f(z) f(x)+ f(x),z x + L 2 z x 2. f α f(x) f(y) f(x),x y α x y 2 2 g(x) = f(x) α 2 x 2 α f x f(y) f(x)+ f(x),y x + α 2 z x 2. min z x Az 2 2 + λ z 1 x A z
X D = arg min X DZ 2 2 + g(z) D,Z D D x z z = arg min x Dz 2 2 + g(z) z l 1 x A R m n p n min z y Az 2 2 + λ z 1 min z y Az 2 2 z 0 <k T A
min z y T (Az) 2 2 + λ z 1 min z y T (Az) 2 2 z 0 <k T ( ) :R m R m ( ) ( ) z = arg min z z = arg min z 1 n 1 2n log(1 + exp( y i x T i z)) + g(z) i max(0, 1 b i x T i z) 2 + g(z) i (x i,y i ) (R n, R) z
l p ell p p<1
2
x k x k Q(x) Q y Q y,l (x) Q(y)+ Q(y) T (x y)+ L 2 x y 2 2 Q : R n R L Q L>L Q Q y,l (x) Q(x) Q Q Q y (x) y arg min Qy,L (x) = x (y 1 L Q(y)) 2 2 x R n y 1 L Q(y)
z k+1 = arg min Qz k,l(z) =z k 1 L Q(zk ) z R n Q Q(z k ) Q(z ) cl Q 2k z0 z 2 2 z Q Q f g Q = f + g y ˆQ y,l (x) f(y)+ f(y) T (x y)+ L 2 x y 2 2 + g(x) L L f f g ˆQ(x) arg min x R n ˆQy,L (x) = arg min x R n g(x)+ L 2 x (y 1 L f(y)) 2 2
g f h (v) arg min x R n h(x)+ 1 2 x v 2 2 z k+1 = L (z k ) arg min ˆQz k,l(z) z R n = arg min z R n g(z)+ L 2 z (zk 1 L f(zk )) 2 2 = 1 L g(zk 1 L f(zk )) g g ( L ) x y := 1 L g(x 1 L f(x)) y
l 1 g(z) =λ z 1 [ λ (v)] i = (v i )( v i λ) + g C l 0 C = {x : x 0 <K} v i : v i [ k (v)] i = 0 : l 0 K l 1 l 1 l 0
1 L L L f L L f L f L η L ˆQ L η =2 1 L ( ) x, L, η y := L (x) L := ηl Q(y) ˆQ x,l (y) y, 1 η L y k z k z k+1 = (y k ) t k+1 =(1+ 1+(2t k ) 2 )/2 y k+1 = z k+1 + tk 1 1 t k (z k+1 z k )
Q Q(z k ) Q(z ) cl f (k + 1) 2 z0 z 2 2 Q(z k ) Q(z k 1 ) s k+1 = (y k ) s k : Q(s k+1 ) <Q(z k ) z k+1 = z k : z k+1 = z k
v k+1 = (y k ) m k+1 = (x k ) v k+1 : f(v k+1 ) f(m k+1 ) z k+1 = m k+1 : t k+1 =(1+ 1+(2t k ) 2 )/2 y k+1 = z k+1 + tk 1 1 t k (z k+1 z k ) f f
ˆQ y (x) f(y)+ f(y) T (x y)+d(x, y)+g(x) D(x, y) = L 2 x y 2 2 D(x, y) = 1 2 x y H H f w A w A = w T Aw H H z k+1 = arg min ˆQ z k(z) = H g (z k H 1 f(z k )) H H h (v) arg min x R n h(x)+ 1 2 x v H
D(x, y) = 1 2 x y H D(x, y) g D(x, y) ψ D(x, y) :=B ψ (x, y) ψ(x) ψ(y) x y, ψ (y) ψ(x) = x 2 2 D(x, y) = x y 2 2 ψ(p) = i {p i log p i p i } g g(z) = n i g(z i) z 1
z 0 R n i {1, 2,...,n} z i := 1 L g(z i 1 L f(z) i) z g l 1 [z k+1 ] ik := arg min z ik f(z)+g(z) z j =[z k ] j j i k i k i k i k i k [ f(z k )] ik g(z) = Ω i Ω g(z Ω i ) Ω i Ω
f g g f min z { f(z) 1 N } N (M z (x i ),y i ) i N (x i,y i ) f f(z) = 1 N N (M z (x i ),y i ) i f f
0 l 1
3 λ 0 >λ 0 λ k λ K = λ {λ 0,...,λ K = λ}
0 (LASSO) min z { 1 2 y Xz 2 2 + λ z 1 } y R N X R N n λ z(λ) S : λ z(λ) S z(λ) λ
z(λ k ) z(λ k+1 ) λ 0 = X T y z(λ 0 )=0 S z(λ k ) [z(λ )] Ω =(X T ΩX Ω ) 1 (X T Ωy λ (z(λ k )) Ω ) [z(λ )] Ω c =0 Ω z(λ k ) z(λ ) λ i Ω c X T i (y Xz(λk )) 1 = λ i Ω i Ω z(λ ) i = 0 i Ω λ k k λ k K K
z 0 =0 Ω={} k =0 i =argmax i X T i (y Xz k ) Ω:=Ω i θ := (z k ) [θ] i := (X T i (y Xz k )) ẑ := 0 [ẑ] Ω := (X T Ω X Ω) 1 (X T Ω y λ θ Ω) z k ẑ z k+1 Ω:={j :[z k+1 ] j 0} k := k +1 z k λ λ λ λ k
λ z =0 R n Ω={} k =0, 2, 3,...,log η (λ/λ 0 ) z k =argmin z x Az + λ k z 1 λ k+1 = η λ k z λ ηλ λ k K K α A T z α
l 1 z 0 α l 0 { min 1 z 2 x } Az 2 2 z 0 <λ λ λ l 0 A λ
z =0 R n Ω={} k =1, 2, 3,...,λ i =argmax i A T i (x Az) Ω:=Ω i z Ω := (A T Ω A Ω) 1 (A T Ω x) z λ k λ λ k 0 λ
4
z 0 =0 [ K (z k+1 z k )] i =0 k K ( ) i (z k ) i ρ τ k 1
τ 1 =min j Ωz k ( (zk ) j ) τ 2 = 1 2 ( δ Ω z k + δ ) δ = (z k ) z k τ k 1 τ 1 τ 2 τ k =min({ρ τ k 1,τ 1,τ 2 }) 0 τ min x {Q(x)} z i τ z i > 0 τ =0 τ z i [ (z k ) z k ] i τ 2 zi k i τ 2 zi k =0
τ τ i τ 0 τ Q τ τ τ k = ρ τ k 1 ρ [0, 1] τ τ τ τ 1 = min (z k ) i i Ω(z k ) τ 2 = 1 2 ( δ Ω(z k ) + δ ) δ = (z k ) z k τ k =min({ρ τ k 1,τ 1,τ 2 }) τ 1 τ 2 τ ρ τ 0 ξ max i [ (z 0 ) z 0 ] i ξ [0, 1] ξ = 1
y i y i τ [ τ (y)] i 0 y i <τ () () x, L, τ, η τ x + := L (x) y := τ (x + ) L := ηl min(q(y),q(x + )) ˆQ x,l (x + ) y, 1 η L z 0 z 0 l 0 g
z 0 =0 k =1, 2, 3,... z k z k τ z k k log ρ 2ϵ L nτ 0 Q((z k )) Q( (z k )) ϵ τ k 0 Q((z k )) Q( (z k )) L 2 (zk ) (z k ) L n i 2 (τ k ) 2 L 2 nρ k τ 0 L f L 2 nρ k τ 0 ϵ k log ρ 2ϵ L nτ 0 z k Ω((z))
g ˆΩ z = { i i Ω(z) a i > 1 2 ( a Ω(z) + a ) } a =((z) f(z)) + 0 z 3 z 0 z1 z 2 f ( f,1) z 0 z 1 z 2 z 3 τ
g i i ˆΩ z [ z (g)] i = 0 i/ ˆΩ z ˆΩ z y yˆωx := 1 L g(xˆωx 1 L f(x)ˆωx ) x, L, τ, η ˆΩ x y := 1 L g(x 1 L τ( f(x))) L := ηl Q(y) Q(x) y, 1 η L z 0 =0 z k = (z k 1 ) Q(z k ) Q(z ) 2nLR2 k +4 R z 0 z
Q(z k ) Q(z k+1 ) 1 2L [ f(zk )] Ω(z k ) 2 1 2nL f(zk ) 2 1 2nLR 2 (Q(z k ) Q(z )) 2 z i z =0 R n τ i ˆΩ z z i := 1 L g(z i 1 L f(z) i) z
z k LARS FSS FISTA ASH:FISTA z k 10 A R 150 200 z 0 = 10
z A R 700 1000 1 l 0
τ l 0
5
l 2 A R 2000 20000 z 0 = 200 ρ 2000 20000 1% 200
ρ z 0 = 50 z 0 = 50 z 0 = 10 z 0 = 10 A l 0 l 0 l 0 l 0 2000 500 500 2000 ρ z 0 =0
FSS MPL PGH FISTA ASH:FISTA CD ASH:CD FSS MPL PGH FISTA ASH:FISTA CD ASH:CD λ λ 2 k λ λ λ λ 2 A R 500 2000 z 0 = 50 Q(z ) 10 4 10 8
x t t x t+1 = D t(x t ) D t {x 0,x 1,...,x N } x i R p
D = arg min D R p n,z 0,z 1,...,z N N { xi Dz i 2 2 + g(z i ) } i N g D D R p n p <n {z i } D D {z i } g l 1 l 0 g
l 1 D D (x) arg min x Dz 2 2 + g(z) z g l 1 g
D R 500 2000 g(z) = g G z g p {x 0,x 1,...,x p } [x] i = max{[x 0 ] i, [x 1 ] i,...,[x p ] i }
AM-FED GENKI CK+ original data after pre-processing l 1 l 0 l 0 l 0
Sparse coding (LASSO/OMP) split max pooling flip Sparse coding (LASSO/OMP) MNNSC-LASSO NNSC-LASSO SC-LASSO MNNSC-OMP NNSC-OMP SC-OMP Gabor
A x R n x = Dz + ϵ D R n t z R t z 0 k k t ϵ y R m x Φ y =Φx Φ R m n x D ẑ = arg min y ΦDz 2 2 + λg(z) z
g l 1 l 0 x ˆx = Dẑ (ΦD) 2k z z 0 k Φ D (ΦD) D m k n t y x Φ l 1 l 1 l 0 l 0
l 0 y
24.5 9.25% 28.1 16.3% 31 24.5% 3 T (x) = 1 α (αx) y = T (Φx)
T ( ) : R m R m 3 ISTA FISTA map ASH:FISTA ASH:mAP ISTA FISTA map ASH:FISTA ASH:mAP ẑ = arg min y T (ΦDz) 2 2 + λg(z) z l 1 l 1
z = arg min z 1 N N log(1 + exp( y i x T i z)) + g(z) i (x i,y i ) (R n, R) l 1
l 1 10 20 l 1 11.59 12 18 24 l 0
f(w) = i (y i, (w, x i )) w f w f 90% l 0 10%
l 0 10% l 0
ASH:Adam ASH:Adam T 50
dense net sparse net sparse net (ASH) l 0
6 f g f
g ˆΩ z A R p n x Az k 2 2 zk f(z k ) = 2(A T x (A T A)z k ) A T x A T Az k A T A z k ˆΩ z
(A T A) Ω z Ω n 2 nk A T (A Ω z Ω ) 2pn pn + pk k n k z k A T A A T A [z k+1 ] s = g ([z k ] s +2A T s (x Az k )) s [z k+1 ] s ([z k+1 ] s [z k ] s )
l 0 l 0 l 0 80%
7