Bayesian Discriminant Feature Selection

1,a) 2 1... DNA. Lasso. Bayesian Discriminant Feature Selection Tanaka Yusuke 1,a) Ueda Naonori 2 Tanaka Toshiyuki 1 Abstract: Focusing on categorical data, we propose a Bayesian feature selection method in which a set of class-specific features are selected for each class for improving the generalization ability of classification. In the proposed model, we introduce latent variable to each feature and each class to decide whether the feature is specific for the class or common in all classes. The latent variables are estimated from given training data by the framework of the Bayesian inference. Unlike the conventional feature selection methods, this enables us to obtain class dependent subset features which are effective for classification. We demonstrate that the proposed method can be superior to the conventional methods in terms of generalization ability through experiments with synthetic and real DNA data sets. We also show that the proposed method can obtain higher classification accuracy than Lasso and Support Vector Machine which indirectly realize feature selection. Keywords: Bayesian models, feature selection, classification 1. ( ) ( ) 1 Presently with Kyoto Uniersity 2 NTT Presently with NTT Comunication Science Laboratories a) ytanaka@sys.i.kyoto-u.ac.jp Guyon [1], 3.. (wrappers).,., 1

, (filters). [2] [3].., [1] (embedded methods). SVM-RFE [4]..,. bi-clustering [5] subset clustering [6] [7] ( ) subset clustering [6] [6] 2,. Lasso [8] 2. 2.1 1 1 (DNA ). 5 DNA.. Fig. 1 Examples of the categorical data (DNA data). Five DNA samples are represented respectively. Color part is the features related to the class. DNA (4 ) 1 x (k) j k j ( j ) 1,..., L j k (relevant) r k,j {1, 0} relevant (r k,j 1) x (k) j k (L ) (r k,j 0) x (k) j (L ) 2.2 k X (k) {x (k) i,j }(i 1,..., N k, j 1,..., M)., N k (k 1,..., K) k, M., x (k) i,j L {1,..., L}, x (k) i,j, 2,. φ {φ 1,..., φ L }, θ k,j {θ k,j,1,..., θ k,j,l }., φ l l, θ k, k j l, L l1 φ l 1, L l1 θ k, 1.,,, L j. 2

, φ, φ j {φ j,1,..., φ j,lj }.,, L, L, r k {0, 1} M., r k,j, λ, r k,j, x (k) i,j. λ. x (k) ij r k,j, φ, θ k,j r k,j λ Bernoulli(r k,j ; λ), { Multinomial(x (k) i,j ; θ k,j), r k,j 1 Multinomial(x (k) i,j ; φ), r k,j 0.,,,... λ a, b Beta(λ; a, b), φ α Dirichlet(φ; α), θ k,j β k,j Dirichlet(θ k,j ; β k,j )., a, b., α φ, β k,j θ k,j., X {X (1),..., X (K) }. p(x R, φ, Θ, λ) K N k M L ( ) (k) I(x θ r i,j l) k,j k, φ1 r k,j l k1 i1 j1 l1 K M L k1 j1 l1 ( θ r k,j k, φ1 r k,j l ) n (k), I(f) f 1, 0, n (k) N k i1 I(x(k) i,j l)., R {r k,j }(k 1,..., K, j 1,..., M), Θ {θ k,j }(k 1,..., K, j 1,..., M). 2.3, (1) ( ) p(r, φ, Θ, λ X) p(x R, φ, Θ)p(R λ),,,. p(x R) p(x R, φ, Θ)p(φ)p(Θ)dφdΘ ( K M 1 l Γ(I(r ) k,j 1)n (k) + β k, ) B(β k,j, ) k1 j1 Γ( l (I(r k,j 1)n (k) + β k, )) ( 1 l Γ( k j I(r ) k,j 0)n (k) + α l ) B(α ) Γ( l ( k j I(r. (2) k,j 0)n (k) + α l )), B( ), β k,j, L l1 β k,, α L l1 α l., Γ( )., p(r) K M p(r k,j λ)p(λ; a, b)dλ k1 j1 1 Γ( k B(a, b) j r k,j + a)γ( k j (1 r k,j) + b) Γ(KM + a + b) (3)., B(a, b). (2) (3) p(r X)., R,., r k,j R r \(k,j), r \(k,j) full conditional p(r k,j r \(k,j), X). k j,. r k,j 1 r k,j 0 γ, (2) (3). γ p(r k,j 1 r \(k,j), X) p(r k,j 0 r \(k,j), X) ( L ) ( Γ(n (k) + β k, ) Γ(β k, ) l1 ( L Γ( L l1 β k,) Γ( L l1 (n(k) ) + β k, )) ) Γ( (s,t) (k,j) I(r s,t 0)n (s) t,l + α l) l1 Γ(n (k) + (s,t) (k,j) I(r s,t 0)n (s) t,l + α l) ( L Γ( l1 (n(k) + (s,t) (k,j) I(r s,t 0)n (s) t,l + α ) l)) Γ( L l1 ( (s,t) (k,j) I(r s,t 0)n (s) t,l + α l)) ( (s,t) (k,j) I(r ) s,t 1) + a (s,t) (k,j) I(r. (4) s,t 0) + b, p(r k,j 1 r \(k,j), X) + p(r k,j 0 r \(k,j), X) 1, (4). p(r k,j 1 r \(k,j), X) γ 1 + γ, (5) p(r k,j 0 r \(k,j),x ) 1 1 + γ. (6) 2.4,, 3

( ). x (x 1,..., x M ), X,. k arg max k p(c k x, X) arg max p(x C k, X)p(C k ). k (7), p(c k ) p(x C k, X), p(x C k, X) M j1 M j1 r k,j p(x j r k,j )p(r k,j X) (8) 1 T t 0 T tt 0+1 k,j ) (9) (8), (8) p(r k,j X)., (9) r (t) k,j t. T, t 1,..., t 0 r k,j,. (9) k,j ), r(t) k,j, k,j 1) p(x j θ k,j )p(θ k,j X)dθ k,j k,j 0) L l1 (n(k) + β k, ) I(xjl) L, (10) l1 (n(k) + β k, ) p(x j φ)p(φ X)dφ L l1 (n( ) + α l) I(xjl) L l1 (n( ) + α. (11) l), n ( ) K k1 n(k).,, (10) k k j, l., (11) j l,. 3. 3.1 r k,j 1 relevant (NB) NB., [1], L 1, 3.1.1 (NB) NB ( ). x (k) i,j θ k,j Multinomial(x (k) i,j ; θ k,j), θ k,j β k,j Dirichlet(θ k,j ; β k,j ).,., θ k,j (MAP)., 5. 3.1.2 (Backward Selection BS). BS,, 1. 5. 3.1.3 L 1,., L 1 [8]. Lasso.,. p(c k x) exp(a k ) K k 1 exp(a (12) k )., a k ln(p(x C k )p(c k )),, a k a k w T k x + w k,0 x.,, T, L 1. W arg max W p(t W) η K k1 j1 M w k,j. (13), η > 0. η 5. 3.1.4 (SVM) SVM, 2. SVM, [10].. RBF,, 5. 4

3.2,. DNA. 10, 2/3, 1/3. 3.2.1,,, L 1. α,. β k,j,., a, b, b 0. Table 1 1 Value of the parameters to sample synthetic data L k,j a, b data1 5 50 50 5 10 0.2 1, 8 data2 5 50 50 10 10 0.3 1, 6 data3 10 50 100 5 10 0.2 1, 6 data4 10 100 50 10 10 0.2 1, 8 data5 30 50 50 10 10 0.2 1, 6 3.2.2, Promoter Splice 2. DNA { A,G,C,T } 4,., DNA. DNA, [11], KBANN. Promoter, 57 2. (Promoter), DNA RNA,.,,. 53. Splice, 60 3., DNA RNA., RNA,. (Splicing). Splice,. - 1, - 2, 3., 1 767, 2 768, 3 1655. 4. 4.1, 1. 10. 5., BDFS, NB, BS (Backward Selection BS). 4.1.1 ( ) 2. BDFS NB, BDFS,., data2 data5 NB 15%. data2 NB,,,, BDFS 96.4%, NB 99.3%. data5 30,.,,. 2 Table 2 ( )(%) Accuracy of using each feature selection method (synthetic data) BDFS NB BS data1 96.6 1.8 88.2 4.1 92.1 1.8 data2 96.0 1.1 80.8 4.1 80.7 3.0 data3 92.3 0.3 90.8 1.0 91.0 1.0 data4 93.6 0.6 86.0 1.2 85.2 0.9 data5 87.7 0.1 61.6 1.6 62.2 1.6 4.1.2 ( ) 3. L1 LR+Lasso, SVM SVM(L), RBF SVM SVM(R).,. LR+Lasso,,,. 5

3 ( )(%) Table 3 Accuracy of each classifier (synthetic data) BDFS LR+Lasso SVM(L) SVM(R) data1 96.6 1.8 90.5 2.8 79.4 1.9 87.4 2.3 data2 96.0 1.1 78.6 4.6 71.0 3.0 79.9 3.6 data3 92.3 0.3 93.1 3.4 82.3 2.2 87.0 1.9 data4 93.6 0.6 91.0 2.9 84.5 3.1 90.2 2.3 data5 87.7 0.1 74.4 2.5 65.1 1.9 69.1 0.9 4.2, 3.2.2 DNA. 4.2.1 (DNA ) 4 DNA. Promoter, BDFS NB BS. BDFS 98.4%, NB 99.6%,., BS, BDFS. Splice,,.,,., Splice, BDFS 95.8%, NB 96.1%. 4 (DNA ) (%) Table 4 Test accurary of using each feature selection method (DNA data) BDFS NB BS Promoter 91.6 5.2 88.8 3.5 89.4 4.5 Splice 95.7 1.4 94.7 0.7 95.1 0.7 4.2.2 DNA, 5., LR+Lasso SVM., BDFS. 5 ( ) Table 5 Accuracy of using each classifier (DNA data) 5.,.,,.,,., L 1,.,,,. [1] I. Guyon and A. Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, vol. 3, pp. 1157-1182, 2003. [2] M. A. Hall, Correlation-based Feature Selection for Machine Learning, PhD Thesis, Department of Computer Science, Waikato University, New Zealand, 1999. [3] H. Peng, F. Long and C. Ding, Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, 2005. [4] I. Guyon, J. Weston, S. Barnhill and V. Vapnik, Gene Selection for Cancer Classification using Support Vector Machine, vol. 46, no. 1-3, pp. 389-422, 2002. [5] H. Shan and A. Banerjee, Bayesian Co-clustering, Proc. IEEE International Conference on Data Mining (ICDM), pp. 530-539, 2008. [6] P. D. Hoff, Subset Clustering of Binary Sequences, with an Application to Genomic Abnormality Data, Biometrics, vol. 61, no. 4, pp. 1027-1036, 2005. [7] K. Ishiguro, N. Ueda and H. Sawada, Subset Infinite Relational Models, Proc. 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012, to appear [8] R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of Royal Statistical Society, vol. 58, no. 1, pp. 267-288, 1996. [9] C. W. Hsu, C. C. Chung and C. J. Lin, A Practical Guide to Support Vector Classification, Technical Report, Department of Computer Science, National Taiwan University, Taipei, 2003. [10] V. N. Vapnik, Statistic Learning Theory, Wiley, New York, 1998. [11] G. G. Towell and J. W. Shavlik, Knowledge-Based Artificial Neural Networks, Artificial Intelligence, vol. 70, no. 1-2, pp. 119-165, 1994. BDFS LR+Lasso SVM(L) SVM(R) Promoter 91.1 5.2 90.8 5.1 79.6 4.3 86.4 5.2 Splice 95.6 1.4 95.8 0.6 88.3 3.5 94.0 5.2 6