Couter Engineering and Alications 29,4(36) 16 1 2 1 2 YOU Wen-jie 1 2 JI Guo-li 1 YUAN Ming-shun 2 1. 36 2. 33 1.Deartent of Autoation Xiaen University Xiaen Fujian 36 China 2.Fuqing Branch Fujian Noral University Fuqing Fujian 33 China E-ail glji@xu.edu.cn YOU Wen-jie JI Guo-li YUAN Ming-shun.Feature reduction on high-diensional sall-sale data.couter Engineering and Alications 29 4 36 16-169. Abstract In view of the characteristics of sall sale and high diensional data Generalized Sall Sales GSS is defined. It reduces inforation feature of GSS feature extraction diensionality extraction and feature selection diensionality selection. Firstly unsuervised feature extraction based on Princial Coonent Analysis PCA and suervised feature extraction based on Partial Least Squares PLS are introduced.secondly analyzing the structure of first PC it resents new global PCA-based and PLSbased feature selection aroaches in addition recursive feature eliination on PLS PLS-RFE is realized.finally the aroaches are alied to the classification of MIT AML/ALL it erfors feature extraction on PCA and PLS and feature selection coared with PLS-RFE.The inforation coression of GSS is realized. Key words generalized sall sale Princial Coonent Analysis PCA Partial Least Squares PLS feature extraction feature selection PCA PLS PCA PLS PLS PLS-RFE MIT AML/ ALL PCA PLS PLS-RFE PCA PLS DOI.3778/j.issn.2-8331.29.36.49 2-8331 29 36-16- A TP391 1 n 2 9 DNA Ranking t- [1-4] No.273843 No.JB8244 1974-196- 1979-29-8-24 29--9
166 29,4(36) Couter Engineering and Alications Y t 1 t 2 PCA n X=[X 1 X 2 X ] Y PLS PCA PLS ax cov Xw i Yc i RFE [-6] PLS s.t. w i w i =1 c i c i =1 PLS-RFE PCA w i X w j = PLS c i Y c j = t i =Xw i i X =X X Y =Y Y [9-] 2 w i c i 2.1 PCA w i = XY YX PCA I-P X XY I-P Y YX i> 1 X c i = YXw i X T I-P Y YX w i i> 1 P X = X W [ X W T X W ] -1 X W T P Y = Y C [ Y C T Y C ] -1 Y C T W= w ij C= c ij PLS t h X n X=[X 1 X 2 X ] Y Y t h X Y t h X T=XW ax var Xw i s.t. w i w i =1 w i X w j = 1i<j T=XW X =X X w i λ i I - w i = [7-8] Rd X t 1 t 2 t = Rd X t h t 1 t 2 t X X =X X λ i =var t λ i 1 λ 2 λ w i X =X X λ i w i W weighing X Rd x j t 1 t 2 t = Rd x j t h t 1 t 2 t x j λ i i w i i weighing < t h Y 1 λ k / λ i t k k=1 λ k / λ i t 1 t 2 t < Rd Y t 1 t 2 t = Rd Y t h t 1 t 2 t Y X 1 X 2 X Rd y 2.2 k t 1 t 2 t = Rd y k t h t 1 t 2 t y k PLS PLS Y 3 PLS X Y PLS X t 1 t 1 X Y u 1 t 1 u 1 Y t 1 X t 1 X t 1 r x i x j 2 t h X Rd x j t h =r x 2 j t h t h x j Rd X t h = 1 Rd x j t h t h X j=1 Rd y k t h =r y 2 k t h t h y k q Rd Y t h = 1 Rd y k t h t h Y q k=1
29,4(36) 167 X t 1 X t 1 t 1 w 1 3.1 1 PCA w 1 PCA 3.2.2 PLS 2 PLS 2.2 PLS 1 t 1 u 1 X Y 2 t 1 u 1 X t 1 X PLS t 1 X t 1 X w 1 3.1.1 PCA PCA 1 n >>n X w 1 PLS 2 X w i 2 3.2.3 PLS-RFE λ k / λ i 1-α α 1-α k=1.8 3 X w i 2 T= t ij =<X i w j > t ij X i 4 T j RFE PLS X PLS-RFE Recursive Feature Eliination 1 3.1.2 PLS Feature Ranking 2 PLS PLS 1 X n >>n Y k PLS-RFE [3] k k 2 4 PMPRESS 4.1 PMPRESS Prob>.1 nfac 3 nfac T = t ij = Acute Lyhoblastic Leukeia ALL <X i w j > t ij X i j Acute Myeloid Leukeia AML 4 T X 3.2 Golub [1] 7 129 38 27 ALL 11 AML Filter 34 2 ALL 14 AML Golub Wraer 38 34 29 PCA PLS PLS PLS-RFE 4.2 3.2.1 PCA SVMs Matlab 2.1 t 1 X SVMs OSU_SVM3. htt //www.kernelethods.net/ LinearSVC t 1 X ρ t 2 1 X =λ j 1 PCA/PLS j=1 3.2.2 PLS
168 29,4(36) Couter Engineering and Alications 1 X / % 1 / % 2 3 Y / % 1 2 3 4 6 7 8 9 1 2 3 4 6 7 8 9 / % / % X Y k k=2 3 PCA SVMs 1 2 PCA 3% PLS 23% 91% 1 4.2.1 7 129 PCA PLS 1 PCA/PLS 1 PCA PLS 7 129 4.2.2 2 1 SVMs 2.2 1 7 129 SVMs 2 1 2 3 PCA 3 1 PCA/PLS PCA PLS 2 3 4 6 7 8 9.82 9.82 9.823.82 9.82 9.7 9.764 7.764 7 1 2 2 2 2 4 4.868 4.97 6.911 8.911 8.911 8.911 8 12 11 12 11 14 11 16 11 17 11 MIT AML/ALL SVMs OSU_SVM3. PCA/PLS ( ) / PLS. -. 2 4 6 a PCA 7 129-7 - - 2 4 6 PCA PLS k SVMs b PLS 7 129 1 2 PCA 2 % 4% PLS 4% PCA/PLS 97.6% Nguyen [2-4] 3 PLS 1 PCA/PLS/PLS-RFE
29,4(36) 169 2 PCA/PLS k PCA PLS PLS-RFE / % / % / % / % / % / % 2 3 4 6 7 8 9 11 12 13 14 1 7 129 71.1 81.6 92.1 92.1 8.8 73. 47.1 61.8 73. 11 3 7 3 7 4 3 1 7 84.2 89. 76. 82.4 2 3 1 7 84.2 89. 76. 82.4 2 3 1 7 MIT AML/ALL SVMs OSU_SVM3. k k=2 3 1 2 LinearSVC PCA PLS 2 PLS-RFE 1 PLS PLS-RFE PCA % SVMs PLS RFE 2 PCA 13 PLS PLS-RFE % PLS PLS-RFE PCA PLS 9 9 PLS PLS-RFE Golub 4.3 SVMs MIT AML/ALL [1] Golub T R Sloni D K Taayo P et al.molecular classification of LOOCV k- k-fold CV holdout 3 k 4-fold PLS-RFE 6.41 X Y cancer Class discovery and class rediction by gene exression onitoring[j].science 1999 286 439 31-37. [2] Nguyen D V Rocke D M.Tuor classification by artial least % PLS squares using icroarray gene exression data [J].Bioinforatics LOOCV PLS PLS-RFE 22 18 1 39-. #66 Golub [1] [3] Nguyen D V Rocke D M.Multi-class cancer classification via artial least squares with gene exression rofiles [J].Bioinforatics [3] 22 18 9 1216-1226. 3 [4] Nguyen D V Rocke D M.On artial least squares diension re- duction for icroarray-based classification A siulation study[j]. Coutational Statistics & Data Analysis 24 46 9 47-42. 72.1 1 #66 [] Guyon I Weston J Barnhill S et al.gene selection for cancer classification using suort vector achines[j].machine Learning 2 46 PLS k 4-fold 6.9 38 34 9 2 13 389-422. 72.1 1 #66 [6]. PLS-RFE k 4-fold 6.41 [J]. C 26 36 1 86-96. 38 34 9 2 [7]. [M]. 2 26-277. % [8] Massey W F.Princial coonents regression in exloratory statistical research[j].journal of Aerican Statistical Association 196 6 234-246. PCA [9] Wold S Ruhe A Wold H et al.the collinearity roble in linear regression the artial least squares PLS aroach to generalized Y X inverses[j].journal of Statistics Coutation 1984 73-743. Y [] Lorber A Wangen L Kowalski B.A theoretical foundation for the X PLS PLS algorith[j].journal of Cheoetrics 1987 1 19-31.