Supplementary Materials: Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Supplemetary Materials: Tradig Computatio for Commuicatio: istributed Stochastic ual Coordiate Ascet Tiabao Yag NEC Labs America, Cupertio, CA 954 tyag@ec-labs.com Proof of Theorem ad Theorem For the proof of Theorem, we first prove the followig Lemma. Lemma. Assume that φ i z is γ-strogly covex fuctio where γ ca be zero. The for ay t > ad s, ], we have E α t α t ] smk smk EP wt α t G t ] λ where G t = ad u t i = φ i x i wt. i= = i Im γ sλ u t i α t i, smk Proof of Lemma. For simplicity, we let Im deote the radom idex {i,, i m } radomly sampled at iteratio t i machie. We begi by boudig the improvemet i the dual objective. By the defiitio of α, we have α t α t ] = φ,iα,i t λg v t φ,i α t,i λg v t By the defiitio of the update, we have g v t = g v t + λ g v t + λ = g v t + λ g v t + λ } {{ } A = i Im α t,ix,i = i Im } {{ } B α,ix t,i g v t + α i I λ ix t i m = i I m α,ix t.iw t + α t λ ix i = = i Im α,ix t.iw t + = i Im mk λ = i Im αi t = i Im

where the first iequality uses the strog covexity of g, the secod equality uses the fact w t = g v t, ad the last iequality uses the Cauchy-Schwarz iequality. The we ca boud A by A = i Im φ,iα t,i + α t,i λg v t = i Im α t,ix,iw t mk λ = i Im We ca further boud A as follows if we maximize R.H.S of above iequality over α,i or if we restrict α,i = s,i u t,i α t,i, where ut,i = φ,i x,i wt. α t,i A = i Im = i Im = i Im = i Im φ,i s,i u t,i α t,i s,i u t,i α t,i λg v t α t i x,iw t mk λ = i Im s,iu t,i α t,i φ,iα t,i s,iφ,iu t,i + γ s,i s,i u t,i α t,i ] λg v t s,i u t,i α t,i x,iw t mk λ = i Im s,iu t,i α t,i where i the first iequality, we use the strog covexity of φ,i. Sice s,i are maximized over the R.H.S of above iequalities, the we have for ay s, ] A = = i Im = i Im + s = s + s = i Im φ,iα t,i sφ,iu t,i + γ s sut,i α t,i ] λg v t su t,i α t,i x,iw t mk λ = i Im s u t,i sφ,iu t,i + x,iw t u t,i }{{} + K = i I sφ,i x,i wt m γ s smk K λ = i Im = i Im u t,i α t,i + s α t,i φ,iα t,i λg v t } {{ } B = i Im φ,i x,iw t + φ,iα t,i + αt,i x,iw t] + B γ s smk λ K = i Im u t,i α t,i φ,iα t,i + αt,i x,iw t] Thus, we have A B s = i Im s λ mk φ,i x,iw t + φ,iα t,i + αt,i x,iw t] γ sλ K s = i Im u t,i α t,i

Optio I: Let α,i = max α φ,iα t,i Optio II: Let z t,i = φ,i x,iw αt Let s,i = max s,] φ,i Let α,i = s,i z t,i α t,i,i Optio III: Let z t,i = φ,i x,iw αt,i Let α,i = s,i z t,i Routie Icualw, scl + α αx,iw scl λ α x,i sz t,i where s,i, ] maximize sφ,iα t,i + φ,ix,iw t + z t,i x,iw + Optio IV oly for smooth loss fuctios: Same as Optios II but replace s,i as s,i = sz t,i x,iw t scl λ s z t,i λγ λγ + scl γs s z t,i scl λ s z t,i x,i Figure : The Basic Variat of the isca Algorithm more optios Taig expectatio over i I m, we have ] A B K m E s = i= s mk λ = mk smk λ = i= φ,i x,iw t + φ,iα t,i + αt,i x,iw t] γ sλ s K m = i= u t,i α t,i φ,i x,iw t + φ,iα t,i + αt,i x,iw t] γ sλ u t,i smk = i= }{{} G t α t,i Note that with w t = g v t, we have gw t + g v t = w t v t ad P w t α t = φ,i x,iw t + λgw t = = = i= = i= = i= = i= φ,i x,iw t + φ,iα t,i + λw t v t φ,i x,iw t + φ,iα t,i + αt,i x,iw t Therefore, we have ] A B Eα t α t ] = E smk P wt α t s mk λ φ,iα t,i λg v t G t ] 3

. Proof of Theorem We first prove the case of smooth loss fuctio. We ca apply Lemma with s = case, G t =. The we have Eα t α t ] smk E P w t α t ] Sice ɛ t we obtai Eɛ t ] smk = exp λγ λγ+mk. I this = α α t P w t α t, ad α t α t = ɛ t ɛt, Eɛ t ] λγmkt λγ + mk smk t Eɛ ] smk t exp smkt where we use that fact ɛ = α, due to that, α P w P. It is the easy to chec that i order to have Eɛ T ] ɛ, it suffices to have T mk + log/ɛ λγ Furthermore, EP w t α t ] T t=t EP w t α t ] smk Eɛt ɛt+ ] smk E ɛ T ɛt smk Eɛt ], ad therefore ] smk EɛT ] We ca complete the proof of Theorem for the smooth loss fuctio by pluggig the values of T ad T.. Proof of Theorem Next, we prove the case of L-Lipschitz cotiuous loss fuctio. We let γ = i Lemma, ad obtai where G t is equal to Eα t α t ] smk G t = EP wt α t ] smk λ Gt i= u t i α t i Followig Lemma 4 i, we ca boud G t by G t 4L. Let G = max t G t. We have Eα t α t ] smk smk EP wt α t G ] λ Similarly, the iequality above idicates the followig iequality, Eɛ t ] smk Eɛ t smk G ] + λ We prove similarly as the followig iequality holds. ɛ t G λ/mk + t t for all t t = max, /mk log λɛ. /GmK I order to have ɛ t ɛ, it suffice to have T G λɛ + t mk 4

Furthermore, we have EP w T ᾱ T ] smkt T Eα α T ] + where w T = T t=t w t /T T ad ᾱ T = T s = /mkt T ad obtai G λt T t=t α t /T T. If T T + EP w T ᾱ T G ] Eα α T ] + λt T I order to have P w T ba T ɛ P, it suffice to have T 4G λɛ P mk + t, ad T T + G λɛ P erivatios of Subproblems i isca ad AMM G λt + + GsmK λ mk, we ca set G λ/mk t + T + G λt T We cosider the l regularizatio where gv = v, w = v. The at t-th each iteratio, we aim to maximize the dual objective o the sampled data poits, i.e., max φ,iα,i t λ α t wt = i Im = φ,iα,i t λ wt + λ = = = i Im = i Im K = i I m K = i I m φ,iα t,i λ φ,iα t,i λ φ,iα t,i λ = i Im K wt + λ i Im wt + λ = ŵt i I m + λ i I m α t,ix,i α t,ix,i α t,ix,i α t,ix,i where ŵ t = w t λ i I α t m,i x,i. Therefore i each machie the goal of each update is to maximize the followig max α i= φ i αi t λ ŵt + λ αix t i where we suppress the subscript. The above problem has the followig primal problem: mi φ i x i w + λ w w w t α t i x i λ i= To see this, we have mi φ i x i w + λ w w w t α t i x i λ i= i= = mi max x i wα i φ i α i + λ w α i w w t λ i= i= = max mi x i wα i φ i α i + λ α w w w t λ i= i= i= i= α t i x i α t i x i 5

primal obj primal obj dd, LSVM, K=, λ= 6. 4 6 8 dd, LSVM, K=, λ= 6. 4 6 8 umber of iteratios isca m= AMMdca ρ= 6 AMMdca ρ= 4 AMMdca ρ= AMMdca ρ= Figure : isca vs AMM. We ca first compute the miimizatio to obtai w = w t we plug this ito above problem, yieldig max φ i α i λ ŵt + α λ i= λ i αt i x i + λ α i x i ad the α i x i I AMM, the primal objective is decomposed across the examples by imposig equality costraits max w,...,w,w s.t. i= φ,i w x,i + λ w = i= w =... = w To solve the problem, a Lagragia fuctio is costructed Lw,..., w, w, u,..., u = φ,i w x,i + λ K w + ρ u w w + ρ = i= The {w }, w, u are optimized alteratively by w t = arg mi w w t = arg mi w u t = u t i= λ w + ρ + w t w t 3 More Experimetal Results = φ,i w x,i + ρ w w t u t K u w w + ρ = K w w We show more experimets i this sectio. Figure compares the practical variat of isca vs AMM with differet pealty parameters. Figure 3 show more experimets by varyig m ad K uder differet settigs. Figure 5 compares the differet variats of isca. = K w w = 6

covtype, LSVM, K=, λ= 4 4 6 8.9.7 covtype, LSVM, K=, λ= 6 4 6 8 umber of iteratios * a varig m m= m= m= m= 3 m= 4 covtype, LSVM, K=, λ= 4 m= m= m= m= 3 m= 4 4 6 8 covtype, LSVM, K=, λ= 6 4 6 8 umber of iteratios * b varyig m m= m= m= 3 m= 4 m= 5. 4 6 8 dd, LSVM, K=, λ= 4 dd, LSVM, K=, λ= 6 4 6 8 umber of iteratios * c varyig m. dd, LSVM, K=, λ= 4 4 6 8 dd, LSVM, K=, λ= 6 4 6 8 umber of iteratios * d varyig m m= m= m= 3 m= 4 m= 5 dd, LSVM, K=, λ= 4 m= m= m= m= 3 m= 4 m= 5. 4 6 8 dd, LSVM, K=, λ= 6 4 6 8 umber of iteratios * e varyig m dd, LSVM, K=, λ= 4 m= m= m= m= 3 m= 4 m= 5 4 6 8 dd, LSVM, K=, λ= 6 4 6 8 umber of iteratios * f varyig m Figure 3: a d: with differet m 7

dualtiy gap covtype, LSVM, m=, λ= 3 K= K= 4 6 8 covtype, LSVM, m=, λ= 6.5 4 6 8 umber of iteratios * a varig K dd, LSVM, m= 3, λ= 4 K= K= 4 6 8 dd, LSVM, m= 3, λ= 6. 4 6 8 umber of iteratios * b varig K covtype, LSVM, m=, λ= 3 4 6 8.9 covtype, LSVM, m=, λ= 6.7 4 6 8 umber of iteratios * c varyig K K= K= covtype, LSVM, m=, λ= 3 K= K= 4 6 8 covtype, LSVM, m=, λ= 3.5 4 6 8 umber of iteratios * d varig K dd, LSVM, m=, λ= 4 K= K=. 4 6 8 dd, LSVM, m=, λ= 6 4 6 8 umber of iteratios * e varyig K dd, LSVM, m=, λ= 4 K= K=. 4 6 8 dd, LSVM, m=, λ= 6. 4 6 8 umber of iteratios * f varyig K Figure 4: with differet K 8

.9 isca LSVM, K=,λ= 4, m= iscaa.9 isca LSVM, K=,λ= 4, m= iscaa.7.7.3 4 6 8 umber of iteratio * a ifferet Variats. 4 6 8 umber of iteratio * b ifferet Variats.9 isca LSVM, K=,λ= 4, m= 3 iscaa.9 isca LSVM, K=,λ= 4, m= 4 iscaa.7.3.7.3.... 4 6 8 umber of iteratio * c ifferet Variats 4 6 8 umber of iteratio * d ifferet Variats.9 isca LSVM, K=,λ= 6, m= iscaa.9 isca LSVM, K=,λ= 6, m= iscaa.7.7 4 6 8 umber of iteratio * e ifferet Variats 4 6 8 umber of iteratio * f ifferet Variats.9.7.3 isca LSVM, K=,λ= 6, m= 3 iscaa.9.7.3.. isca dd, LSVM, K=, λ= 6, m= 4 iscaa. 4 6 8 umber of iteratio * g ifferet Variats 4 6 8 umber of iteratios * h ifferet Variats Figure 5: compariso of differet variats of isca with differet m ad λ o dd data. 9