Supplemental Material: Scaling Up Sparse Support Vector Machines by Simultaneous Feature and Sample Reduction

Supplemetal Material: Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio Weizhog Zhag * 2 Bi Hog * 3 Wei Liu 2 Jiepig Ye 3 Deg Cai Xiaofei He Jie Wag 3 State Key Lab of CAD&CG, Zhejiag Uiversity, Chia 2 Tecet AI Lab, Shezhe, Chia, 3 Uiversity of Michiga, USA I this supplemet, we first preset the detailed proofs of all the theorems i the mai text ad the report the rest experimet results which are omitted i the experimet sectio due to the space limitatio. A. Proof for Theorem Proof. of Theorem : (i) : Let X = ( x, x 2,..., x ) ad z = X T w, the primal problem (P ) the is equivalet to The Lagragia the becomes α mi w R p,z R 2 w 2 + β w + s.t. z = X T w. L(w, z, θ) = α 2 w 2 + β w + l([z] i ), i= l([z] i ) + X T w z, θ (7) i= = α 2 w 2 + β w Xθ, w + }{{} :=f (w) We first cosider the subproblem mi w L(w, z, θ): By substitutig (9) ito f (w), we get l([z] i ) z, θ } i= {{ } :=f 2(z) w L(w, z, θ) = w f (w) = αw Xθ + β w +, θ (8) Xθ αw + β w w = α S β( Xθ) (9) f (w) = α 2 w 2 + β w αw + β w, w = α 2 w 2 = 2α S β( Xθ) 2. (2) The, we cosider the problem mi z L(w, z, θ): = [z]i L(w, z, θ) = [z]i f 2 (z) = [θ] i =, if [z] i <, γ [z] i, if [z] i γ,, if [z] i > γ. [θ] i, if [z] i <, γ [z] i [θ] i, if [z] i γ, [θ] i, if [z] i > γ. (2) Thus, we have

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio { γ f 2 (z) = 2 θ 2, if [θ] i [, ], i [],, otherwise. (22) Combiig Eq. (7), Eq. (2) ad Eq. (22), we obtai the dual problem: (ii) : From Eq. (9) ad Eq. (2), we get the KKT coditios: mi θ [,] 2α S β( Xθ) 2 + γ 2 θ 2, θ (23) w (α, β) = α S β( X T θ (α, β)), if x i, w (α, β) <, [θ (α, β)] i = γ ( x i, w (α, β) ), if x i, w (α, β) γ,, if x i, w (α, β) > γ. i =,...,. B. Proof for Lemma Proof. of Theorem : ) It is the coclusio of the aalysis above. 2) After feature screeig, the primal problem (P ) is scaled ito: α mi w R ˆF c 2 w 2 + β w + Thus, we ca easily derive out the dual problem of (scaled-p -): ad also the KKT coditios: l( [ x i ] ˆF c, w ), (scaled-p -) i= mi D( θ; α, β) = θ [,α] 2α S β( ˆF [ X] θ) 2 + γ c 2 θ 2, θ. (scaled-d -) w (α, β) = α S β( ˆF [ X] θ (α, β)) c, if [ x i ] ˆF c, w (α, β) <, [ θ (α, β)] i = γ ( [ x i] ˆF c, w (α, β)), if [ x i ] ˆF c, w (α, β) γ,, if [ x i ] ˆF c, w (α, β) > γ, (scaled-kkt-) (scaled-kkt-2) The, it is obvious that w (α, β) = [w (α, β)] ˆF c, sice essetially, problem (scaled-p -) ca be derived by substitutig to the weights for the elimiated features i problem (P ) ad optimize over the rest weights. Sice the solutios w (α, β) ad θ (α, β) satisfy the coditios KKT- ad KKT-2 ad [ x i ] ˆF c, w (α, β) = x i, w (α, β) for all i, we kow w (α, β) ad θ (α, β) satisfy the coditios scaled-kkt- ad scaled-kkt-2. So they are the solutios of problems (scaled-p -) ad (scaled-d -). Thus, due to the uiqueess of the solutio of problem (scaled-d -), we have θ (α, β) = θ (α, β) (24) From ) we have, [ θ (α, β)] ˆRc = ad [ θ (α, β)] ˆLc =. Therefore, from the dual problem (scaled-d ), we ca see that [ θ (C, α)] ˆDc ca be recovered from the followig problem: mi ˆθ [,] ˆD c 2α S β( ˆθ + 2 Ĝ Ĝ2) + γ 2 ˆθ 2, ˆθ, Sice [ θ (α, β)] ˆDc = [θ (α, β)] ˆDc, the proof is therefore completed.

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio C. Proof for Lemma 2 Proof. Due to the α-strog covexity of the objective P (w; α, β), we have P (w (α, β ); α, β ) P (w (α, β ); α, β ) + α 2 w (α, β ) w (α, β ) 2 which are equivalet to P (w (α, β ); α, β ) P (w (α, β ); α, β ) + α 2 w (α, β ) w (α, β ) 2 α 2 w (α, β ) 2 + β w (α, β ) + l( x i, w (α, β ) ) i= α 2 w (α, β ) 2 + β w (α, β ) + l( x i, w (α, β ) ) + α 2 w (α, β ) w (α, β ) 2 α 2 w (α, β ) 2 + β w (α, β ) + i= α 2 w (α, β ) 2 + β w (α, β ) + + α 2 w (α, β ) w (α, β ) 2 Addig the above two iequalities together, we get α α 2 w (α, β ) 2 α α 2 l( x i, w (α, β ) ) i= l( x i, w (α, β ) ) i= w (α, β ) 2 + α + α w (α, β ) w (α, β ) 2 2 w (α, β ) α + α 2α w (α, β ) 2 (α α ) 2 4α 2 w (α, β ) 2 (25) Substitute the prior that [w (α, β )] ˆF = ito (25), we get [w (α, β )] ˆF c α + α 2α [w (α, β )] ˆF c 2 (α α ) 2 4α 2 w (α, β ) 2 (α + α) 2 4α 2 [w (α, β )] ˆF 2. D. Proof for Lemma 3 Proof. Firstly, we eed to exted the defiitio of D(θ; α, β) to R : { D(θ; α, β), if θ [, ] D(θ;, α, β) = +, otherwise (26) Due to the strog covexity of objective D(θ; α, β), we have D(θ (α, β ), α, β ) D(θ (α, β ), α, β ) + γ 2 θ (α, β ) θ (α, β ) 2, D(θ (α, β ), α, β ) D(θ (α, β ), α, β ) + γ 2 θ (α, β ) θ (α, β ) 2. Sice θ (α, β ), θ (α, β ) [, ], the above iequalities are equivalet to

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio 2α S β ( X T θ (α, β )) 2 + γ 2 θ (α, β ) 2, θ (α, β ) 2α S β ( X T θ (α, β )) 2 + γ 2 θ (α, β ) 2, θ (α, β ) + γ 2 θ (α, β ) θ (α, β ) 2, S β ( 2α X T θ (α, β )) 2 + γ 2 θ (α, β ) 2, θ (α, β ) 2α S β ( X T θ (α, β )) 2 + γ 2 θ (α, β ) 2, θ (α, β ) + γ 2 θ (α, β ) θ (α, β ) 2. Addig the above two iequalities, we get That is equivalet to γ(α α ) 2 γ(α α ) 2 θ (α, β ) 2 α α, θ (α, β ) θ (α, β ) 2 α α, θ (α, β ) + γ(α + α) θ (α, β ) θ (α, β ) 2 2 θ (α, β ) 2 α α γα + α + α α θ (α, β ), θ (α, β ) α α θ (α, β ) 2 α α γα, θ (α, β ) (27) That is θ (α, β ) ( α α 2γα + α + α 2α θ (α, β )) 2 ( α α 2α )2 θ (α, β ) γ 2 (28) Substitute the priors that [θ (α, β )] ˆR = ad [θ (α, β )] ˆL = ito (28), we have [θ (α, β )] ˆDc ( α α 2γα + α + α 2α [θ (α, β )] ˆDc) 2 ( α α 2α )2 θ (α, β ) γ 2 (2γ )α + α α + α 2γα 2α [θ (α, β )] ˆL 2 α α 2γα + α + α 2α [θ (α, β )] ˆR 2. E. Proof for Lemma 4 Before the proof of Lemma 4, we should prove that the optimizatio problem i () is equivalet to { } s i (α, β ) = max θ Θ [ xi ] ˆDc, θ + [ x i ] ˆL,, i ˆF c. (29) To avoid otatioal cofusio, we deote the feasible regio Θ i () as Θ. The, { } { max θ Θ [ Xθ] i = max x i θ } θ Θ { } = max [ x i ] ˆDc[θ] θ Θ ˆDc + [ x i ] ˆL[θ] ˆL + [ x i ] ˆR[θ] ˆR { } = max θ Θ [ xi ] ˆDc, [θ] ˆDc + [ x i ] ˆL, = s i (α, β ).

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio The last equatio holds sice [θ] ˆL =, [θ] ˆR = ad [θ ˆDc] Θ. Proof. of Lemma 4: s i (α, β ) = max { θ B(c,r) [ xi ] ˆDc, θ + [ x i ] ˆL, }. = max { η B(,r) [ xi ] ˆDc, c + [ x i ] ˆL, + [ x i ] ˆDc, η } = ( [ x i ] ˆDc, c + [ x i ] ˆL, + [ x i ] ) ˆDc r The last equality holds sice [ x i ] ˆDc r [ x i ] ˆDc, η [ x i ] ˆDc r. F. Proof for Theorem 4 Proof. () It ca be obtaied from the the rule (R). (2) It is from the defiitio of ˆF. G. Proof for Lemma 5 Firstly, we eed to poit out that the optimizatio problems i (2) ad (3) are equivalet to the problems: They follow from the fact that [w] ˆF c W ad Proof. of Lemma 5: u i (α, β ) = max w W { [ x i] ˆF c, w }, i ˆD c, (3) l i (α, β ) = mi w W { [ x i] ˆF c, w }, i ˆD c (3) { w, x i } ={ [w] ˆF c, [ x i ] ˆF c [w] ˆF, [ x i ] ˆF } ={ [w] ˆF c, [ x i ] ˆF c } (sice [w] ˆF = ). u i (α, β ) = max { [ x i] ˆF c, w } w B(c,r) = max η B(,r) { [ x i] ˆF c, c [ x i ] ˆF c, η } = [ x i ] ˆF c, c + max η B(,r) { [ x i] ˆF c, η } = [ x i ] ˆF c, c + [ x i ] ˆF c r H. Proof for Theorem 5 Proof. () It ca be obtaied from the the rule (R2). l i (α, β ) = mi { [ x i] ˆF c, w } w B(c,r) = mi η B(,r) { [ x i] ˆF c, c [ x i ] ˆF c, η } = [ x i ] ˆF c, c + mi η B(,r) { [ x i] ˆF c, η } = [ x i ] ˆF c, c [ x i ] ˆF c r

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio (2) It is from the defiitios of ˆR ad ˆL. I. Proof for Theorem 2 Proof. of Theorem 2: We prove this theorem by verifyig that the solutios w (α, β) = ad θ (α, β) = satisfy the coditios KKT- ad KKT-2. Firstly, sice β β max = X, we have S β ( X) =. Thus w (α, β) = ad θ (α, β) = satisfy the coditio KKT-. The, for all i [], we have x i, w (α, β) = > γ. Thus w (α, β) = ad θ (α, β) = satisfy the coditio KKT-2. Hece, they are the solutios for the primal problem (P ) ad the dual problem (D ), respectively. J. Proof for Theorem 3 Proof. of Theorem 3: Similar with the proof of Theorem 2, we prove this theorem by verifyig that the solutios w (α, β) = α S β( Xθ (α, β)) ad θ (α, β) = satisfy the coditios KKT- ad KKT-2.. Case : α max (β). The for all α >, we have mi i [] { x i, w (α, β) } = mi i [] { α x i, S β ( Xθ (α, β)) } = mi i [] { α x i, S β ( X) } = α max i [] x i, S β ( X) = ( γ) α α max(β) > γ The, L = [] ad w (α, β) = α S β( Xθ (α, β)) ad θ (α, β) = satisfy the coditios KKT- ad KKT-2. Hece, they are the optimal solutio for the primal ad dual problems (P ) ad (D ). 2. Case 2: α max (β) >. The for ay α α max (β), we have mi i [] { x i, w (α, β) } = mi i [] { α x i, S β ( Xθ (α, β)) } = mi i [] { α x i, S β ( X) } = α max i [] x i, S β ( X) = ( γ) α α max(β) ( γ) = γ. Thus, E L = [] ad w (α, β) = α S β( Xθ (α, β)) ad θ (α, β) = satisfy the coditios KKT- ad KKT-2. Hece, they are the optimal solutio for the primal ad dual problems (P ) ad (D ). K. Proof for Theorem 6 Proof. of Theorem 6: () Give the referece solutios pair w (α i,j, β j ) ad θ (α i,j, β j ), if we do ISS first i SIFS ad apply ISS ad IFS for ifiite times. If after p times of triggerig, o ew iactive features or samples are idetified, the we ca deote the sequece of ˆF, ˆR ad ˆL as: ˆF A = ˆR A = ˆL A = ISS ˆF A, ˆR A, ˆL A IF S ˆF A 2, ˆR A 2, ˆL A 2 ISS... ˆF A p, ˆR A p, ˆL A p IF S/ISS... (32)

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio with ˆF A p = ˆF A p+ = ˆF A p+2 =..., ˆR A p = ˆR A p+ = ˆR A p+2 =..., ad ˆL A p = ˆL A p+ = ˆL A p+2 =... (33) I the same way, if we do IFS first i SIFS ad o ew iactive feature or samples are idetified after q times of triggerig of ISS ad IFS, the the sequece ca be deoted as: ˆF B = ˆR B = ˆL B = IF S ˆF B, ˆR B, ˆL B ISS ˆF B 2, ˆR B 2, ˆL B 2 IF S... ˆF B q, ˆR B q, ˆL B q IF S/ISS... (34) with ˆF B q = ˆF B q+ = ˆF B q+2 =..., ˆR B q = ˆR B q+ = ˆR B q+2 =..., ad ˆL B q = ˆL B q+ = ˆL B q+2 =... (35) We first prove that ˆF k B ˆF k+ A, ˆR B k ˆR A k+ ad ˆL B k ˆL A k+ hold for all k by iductio. ) Whe k =, the equalities ˆF B ˆF A, ˆR B ˆR A ad ˆL B ˆL A hold sice ˆF B = ˆR B = ˆL B =. B 2) If ˆF k ˆF k+ A, ˆR B k ˆR A k+ ad ˆL B k ˆL A k+ hold, by the syergy effect of ISS ad IFS, we have ˆF k+ B ˆF k+2 A, ˆR B k+ ˆR A k+2 ad ˆL B k+ ˆL A k+2 hold. B Thus, ˆF k ˆF k+ A, ˆR B k ˆR A k+ ad ˆL B k ˆL A k+ hold for all k. Similar with the aalysis i (), we ca also prove that ˆF k A ˆF k+ B, ˆR A k ˆR B k+ ad ˆL A k ˆL B k+ hold for all k. Combie () ad (2), we ca get ˆF B ˆF A ˆF B 2 ˆF A 3... (36) ˆF A ˆF B ˆF A 2 ˆF B 3... (37) ˆR B ˆR A ˆR B 2 ˆR A 3... (38) ˆR A ˆR B ˆR A 2 ˆR A 3... (39) ˆL B ˆL A ˆL B 2 ˆL A 3... (4) ˆL A ˆL B ˆL A 2 ˆL B 3... (4) by the first equality of (33), (36) ad (37), we ca get ˆF A p = ˆF B q. Similarly, we ca get ˆR A p = ˆR B q ad ˆL A p = ˆL B q. (2) If p is odd, the by (36), (38 ad (4), we have ˆF A p ˆF B p+, ˆR A p ˆR B p+ ad ˆL A p ˆL B p+. Thus q p +. Else if p is eve, the by (37), (39) ad (4), we have ˆF A p ˆF B p+, ˆR A p ˆR B p+ ad ˆL A p ˆL B p+. Thus q p +. Do the same aalysis for q, we ca get p q +. Hece, p q. L. Experimet Result L.. Verificatio of the Syergy Effect Here, we verify the syergy effect betwee ISS ad IFS i SIFS from the experimet results o the dataset real-sim. I Fig. 4, SIFS performs ISS (sample screeig) first, while i Fig. 5, it performs IFS (feature screeig) first. All the rejectio ratios (Fig. 4(a)-(d)) of the st triggerig of IFS whe SIFS performs ISS first are much higher tha (at least equal to) those (Fig. 5(a)-(d)) whe SIFS performs IFS first. I tur, all the rejectio ratios (Fig. 5(e)-(h)) of the st triggerig of ISS whe SIFS performs IFS first are also much higher tha those (Fig. 4(e)-(h)) whe SIFS performs ISS first. This demostrates that the screeig result of ISS ca reiforce the capability of IFS ad vice versa, which is the so called syergy effect. At last, i Fig. 5 ad Fig. 4, we ca see that the overall rejectio ratios at the ed of SIFS are the same, so o matter which (ISS or IFS) we perform first i SIFS, SIFS has the same screeig performaces i the ed. This is cosistet with Theorem 6. L.2. The Rest Experimet Result Below, we report the rejectio ratios of SIFS o sy (Fig. 6), sy3 (Fig. 7), rcv-trai (Fig. 8), rcv-test(fig. 9), url (Fig. ) ad kddb (Fig. ), which are omitted i the mai text due to the space limitatio.

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio.995.99.985 Trigger..3..4 Trigger..3..4 Trigger..3..4 Trigger..3..4 (a) β/β max=.5 (b) β/β max=. (c) β/β max= (d) β/β max=.9 Trigger..3..4 Trigger..3..4 Trigger..3..4.995.99 Trigger.985..3..4 (e) β/β max=.5 (f) β/β max=. (g) β/β max= (h) β/β max=.9 Figure 4. Rejectio ratios of SIFS o real-sim whe it performs ISS first (first row: Feature Screeig, secod row: Sample Screeig)..99.98.97 Trigger..3..4 (a) β/β max=.5 Trigger..3..4 (b) β/β max=. Trigger..3..4 (c) β/β max= Trigger..3..4 (d) β/β max=.9 Trigger..3..4 Trigger..3..4.995.99 Trigger.985..3..4 Trigger..3..4 (e) β/β max=.5 (f) β/β max=. (g) β/β max= (h) β/β max=.9 Figure 5. Rejectio ratios of SIFS o real-sim whe it performs IFS first(first row: Feature Screeig, secod row: Sample Screeig). Trigger..3..4 (a) β/β max=.5 Trigger..3..4 (b) β/β max=. Trigger..3..4 (c) β/β max= Trigger..3..4 (d) β/β max=.9.98.96 Trigger..3..4.98.96 Trigger..3..4.98.96 Trigger..3..4.995.99.985 Trigger..3..4 (e) β/β max=.5 (f) β/β max=. (g) β/β max= (h) β/β max=.9 Figure 6. Rejectio ratios of SIFS o sy (first row: Feature Screeig, secod row: Sample Screeig).

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio Trigger..3..4 Trigger..3..4 Trigger..3..4 Trigger..3..4 (a) β/β max=.5 (b) β/β max=. (c) β/β max= (d) β/β max=.9 Trigger..3..4 Trigger..3..4 Trigger..3..4 Trigger..3..4 (e) β/β max=.5 (f) β/β max=. (g) β/β max= (h) β/β max=.9 Figure 7. Rejectio ratios of SIFS o sy3 (first row: Feature Screeig, secod row: Sample Screeig)..995.99.985 Roud..3..4 (a) β/β max=.5 Roud..3..4 (b) β/β max=. Roud..3..4 (c) β/β max= Roud..3..4 (d) β/β max=.9 Roud..3..4 Roud..3..4 Roud..3..4.995.99 Roud..3..4 (e) β/β max=.5 (f) β/β max=. (g) β/β max= (h) β/β max=.9 Figure 8. Rejectio ratios of SIFS o rcv-trai dataset (first row: Feature Screeig, secod row: Sample Screeig)..995.99 Roud..3..4 Roud..3..4 Roud..3..4 Roud..3..4 (a) β/β max=.5 (b) β/β max=. (c) β/β max= (d) β/β max=.9 Roud..3..4 Roud..3..4 Roud..3..4.995.99 Roud..3..4 (e) β/β max=.5 (f) β/β max=. (g) β/β max= (h) β/β max=.9 Figure 9. Rejectio ratios of SIFS o rcv-test dataset (first row: Feature Screeig, secod row: Sample Screeig).

Scalig Up Sparse Support Vector Machies by Simultaeous Feature ad Sample Reductio Roud..3..4 Roud..3..4 Roud..3..4 Roud..3..4 (a) β/β max=.5 (b) β/β max=. (c) β/β max= (d) β/β max=.9 Roud..3..4 Roud..3..4 Roud..3..4 Roud..3..4 (e) β/β max=.5 (f) β/β max=. (g) β/β max= (h) β/β max=.9 Figure. Rejectio ratios of SIFS o url dataset (first row: Feature Screeig, secod row: Sample Screeig). Roud..3..4 Roud..3..4 Roud..3..4 Roud..3..4 (a) β/β max=.5 (b) β/β max=. (c) β/β max= (d) β/β max=.9 Roud..3..4 Roud..3..4 Roud..3..4 Roud..3..4 (e) β/β max=.5 (f) β/β max=. (g) β/β max= (h) β/β max=.9 Figure. Rejectio ratios of SIFS o kddb dataset (first row: Feature Screeig, secod row: Sample Screeig).