Gaussian Processes Classification Combined with Semi-supervised Kernels

35 7 Vol. 35, No. 7 2009 7 ACTA AUTOMATICA SINICA July, 2009 1 1 1 2. : 1) ; 2) ; 3),. :,.,.,,, TP391 Gaussian Processes Classification Combined with Semi-supervised Kernels LI Hong-Wei 1 LIU Yang 1 LU Han-Qing 1 FANG Yi-Kai 2 Abstract In this paper, we present a semi-supervised algorithm to learn Gaussian process classifiers, which is combined with nonparametric semi-supervised kernels in the presence of unlabeled data. This algorithm mainly includes the following aspects: 1) The spectral decomposition of graph Laplacians is used to obtain kernel matrices incorporating labeled and unlabeled data; 2) The convex optimization method is employed to learn the optimal weights of kernel matrix eigenvectors, which construct the nonparametric semi-supervised kernels; 3) The proposed semi-supervised learning algorithm is obtained by incorporating semi-supervised kernels into Gaussian process model. The main characteristic of the proposed algorithm is that we employ the nonparametric semi-supervised kernels based on the entire dataset into the Gaussian process model, which has an explicit probabilistic interpretation, and can model the uncertainty among the data and solve the complex non-linear inference problems. The effectiveness of the proposed algorithm is demonstrated by the experimental results in comparison with other related works in the literature. Key words Gaussian processes, semi-supervised learning, kernel method, convex optimization, [1]. :,.,,,.,., 2008-06-16 2008-11-16 Received June 16, 2008; in revised form November 16, 2008 (863 ) (2006AA01Z315), (60833006), (60605004) Supported by National High Technology Research and Development Program of China (863 Program) (2006AA01Z315), Key Project of National Natural Science Foundation of China (60833006), and National Natural Science Foundation of China (60605004) 1. 100190 2. 100176 1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190 2. Nokia Research Center, China, Beijing 100176 DOI: 10.3724/SP.J.1004.2009.00888,,.,,,,,., [2 9],,,.,,,,.,,,.,.

7 : 889,., : 1) (Support vactor machine, SVM) Margin,. [1], (Null-category noise model, NCNM),, NCNM Margin, (Transductive support vector machine, TSVM) [1,10],. Rogers [4] (Multinomial probit) NCNM, Lawrence ; 2),,,. [11 14], ( ),,, (Spectral clustering) [12], (Diffusion kernels) [13], (Gaussian random field) [14],, ( ). Sindhwani [15], (Reproducing kernel Hilbert spaces, RKHS),,,,, [16].,,,,,.,, (Kernel alignment), RKHS,,,. 1 1.1 [2],, µ Σ f = (f 1,, f n ) N(µ, Σ) (1), f i. f(x) m(x) ( ) K(x, x ) f(x) GP (m(x), K(x, x )) (2), x i, : m(x) = E[f(x)], : K(x, x ) = E[(f(x) m(x))(f(x ) m(x ))].,, f(x),,.,,. 1.2. X = {X l, X u }, X l = {x 1,, x l }, Y l = {y 1,, y l }, X u = {x l+1,, x l+u }. f(x) : f(x, θ) GP (0, K),, K R (), θ K. f, X Y. p(y = +1 f(x)) = Φ(f(x)) (3), Φ S (Sigmoid), (Logistic) (Cumulative Gaussian). f,, l+u l+u p(y f) = p(y i f i ) = Φ(y i f i ) (4) p(f) p(y f),

890 35 p(f y, X, θ) = p(y f)p(f X, θ) p(y X, θ) (5) x t, f(x t ) p(f t y, X, θ, x t ) = p(f t f, X, θ, x t )p(f y, X, θ)df (6) x t p(y t y, X, θ, x t ) = p(y t f t )p(f t y, X, θ, x t )df t (7), : f(x), ( Sigmoid ),. 1.3, 1.,,. Fig. 1 1 The graphical representation of Gaussian processes in the discriminative framework 1, x i X l, f i, x i K y i, y i, x i K,,. x j X u, y j, x j K, x j f,.,,. 2 2.1 RKHS X = {X l, X u },, { } 1 l f = arg min C(f, x i, y i ) + Ω( f H ) (8) l, C, f ; f H f, f. (Representer theorem) [17], l f (x) = α i K(x, x i ) (9), f {x i }.,,., {x i, y i } l {x i, y i } l+u i=l+1, f = arg min f H { 1 l + u } C(f, x i, y i ) + Ω( f H) (10), H,. Sindhwani [15] H ( ), H. V, S, H H : f, g H = f, g H + Sf, Sg V. H, f H H, H, H K(x i, y j ) [15] K(x i, y j )=K(x i, y j ) K T x (I +MK) 1 MK x (11), K x = [K(x 1, x),, K(x l+u, x)] T, M, K(x i, x j ), RKHS,,

7 : 891,,.,,,. RKHS. 2.2 RKHS :, (),. X = {X l, X u } G = {V, E}, V, E, W = {w ij }, x i x j, w ij = 0. D ii = j W ij W, : L = D W, L L = λ i φ i φ T i (12), λ i L, λ 1 λ l+u, φ i L. λ i : λ i = r(λ i ), K = r(λ i )φ i φ T i, r(λ i ) 0 (13) K : f 2 K = f T K 1 f., f,,.,,,,. r(λ), r(λ) λ, φ i ( λ i ) K. r(λ),. r(λ),, RKHS, K = α i φ i φ T i, α i 0 (14), α i : α i α i+1, α i, f, α i. K, RKHS K(x i, y j )=K (x i, y j ) K T x (I +LK) 1 LK x (15) 2.3 [18 19],. :,,. ( ),. A(K, T ) = K, T F K, K F T, T F (16) K K, ; T,, T ij = y i y j,, y i = y j, T ij = 1, T ij = 1.,,, : 1),, ; 2),, ; 3) :,. 2.4 (16),. : α i α i+1,,,., s. t. max K [A(K, T )] (17) K = α i K i (18) tr(k ) = 1 (19)

892 35 α i 0 (20) α i α i+1, i = 1,, l + u (21), K i = φ i φ T i, φ i. (19), (20) K. (17), A(K, T ) K, T F, T T ij +1 1, (18) α i,,. K, (15) RKHS K,. 3 3.1,,, 2., X, Y, f, K RKHS. p(y t y, X, K, x t ) = p(y t f t )p(f t y, X, K, x t )df t (23) (4) p(y f) Sigmoid, p(f 0, K), (22) p(f X, K) (23) p(y t X, K, x t ). : p(f X, K) q(f X, K) = N(f m, Σ),. (6), f x t q(f y, X, K, x t ) = N(f t µ t, σt 2 ), µ t = k T t K 1 m (24) σ 2 t = k(x t, x t ) k T t ( K 1 K 1 Σ K 1 )k t (25), k t x t X. Probit, x t q(y t =+1 y, X, K, x t )= Φ(f t )N(f t µ t, σ 2 t )df t = ( ) µ t Φ (26) 1 + σ 2 t Fig. 2 2 The graphical structure of Gaussian processes combined with semi-supervised kernels : 1), L; 2) L, φ i ; 3) K, (15) RKHS K; 4) K (7), f GP (0, K) K,. (4), (5) (7) p(f X, K) N(f 0, K) l+u = p(y X, K) Φ(y i f i ) (22), q(f X, K) = N(f m, Σ) m Σ (Expectation propagation, EP) [20]. EP f, (Kullback-Leibler divergence) [21]., EP. 3.2 : 1) : K, x t, O(N 2 ), N ; 2) :, N N, O(N 3 ), EP ( θ), K (Cholesky), N 3 /6;,

7 : 893 α i, α i 0,,, O(N 2 ), O(N 3 ),., [22], O(N log N),., ( ),,, ( ), ;. 4 4.1,. [11], 1. One vs Two Odd vs Even Cedar Buffalo,. One vs Two 1 2, 2 200. Odd vs Even 0, 2, 4, 6, 8 1, 3, 5, 7, 9, 4 000. Pc vs Mac Baseball vs Hockey 20,. Pc vs Mac 1 943, Baseball vs Hockey 1 993., One vs Two Odd vs Even, (Euclidean 10-nearest-neighbor, 10NN) ; Pc vs Mac Baseball vs Hockey, (Cosine similarity 10-nearest-neighbor, 10NN). 1 Table 1 The information of database on Experiment 1 One vs Two 2 200 2 Euclidean 10 NN Odd vs Even 4 000 2 Euclidean 10 NN Pc vs Mac 1 943 2 Cosine similarity 10 NN Baseball vs Hockey 1 993 2 Cosine similarity 10 NN [15], 2. G50c, ; Coil20 20 32 32, ; Uspst USPS,. 2 Table 2 The information of database on Experiment 2 G50c 2 550 50 50 Coil20 20 1 140 40 1 024 Uspst 10 2 007 50 256 4.2, 5,., 20,,.,, (), RBF (Radial basis function),, RBF ( K(x, x ) = θ 1 exp (x ) x ) T (x x ) (27) 2θ2 2, θ 1 θ 2 RBF, q(y X, θ) q(y X, θ) = p(y f)q(f X, θ)df (28) q(y X, θ), RBF θ i, θ i (6) (7) RBF. 3 6 ( ), Zhu [11] ( ), ( ) ( )., SVM.,,. ( ).,.

894 35 3 One vs Two (%) Table 3 The average accuracies on the One vs Two datasets (%) (RBF) [11] [11] 10 96.1 96.2 67.2 65.3 85.1 78.7 20 97.3 96.4 81.4 85.6 91.6 90.4 30 98.3 98.2 92.5 93.4 94.9 93.6 40 98.7 98.3 95.6 94.7 95.6 94.0 50 98.9 98.4 96.9 95.9 97.1 96.1 4 Odd vs Even (%) Table 4 The average accuracies on the Odd vs Even datasets (%) (RBF) [11] [11] 10 69.7 69.6 62.7 60.1 69.3 65.0 30 83.7 82.4 80.4 76.3 79.6 77.7 50 88.2 87.6 85.2 82.6 83.6 81.8 70 90.1 89.2 87.3 85.3 86.4 84.4 90 92.6 91.5 90.1 89.5 87.2 86.1 5 Pc vs Mac (%) Table 5 The average accuracies on the Pc vs Mac datasets (%) (RBF) [11] [11] 10 87.8 87.0 55.2 53.4 54.9 51.6 30 90.9 90.3 75.7 72.6 65.4 62.6 50 92.2 91.3 79.3 77.2 70.2 67.8 70 93.4 91.5 83.5 80.7 76.8 74.7 90 93.6 91.5 85.8 83.2 81.2 79.0 6 Baseball vs Hockey (%) Table 6 The average accuracies on the Baseball vs Hockey datasets (%) (RBF) [11] [11] 10 96.2 95.7 61.2 59.4 60.2 53.6 30 98.2 98.0 69.9 68.7 72.6 69.3 50 98.6 97.9 77.5 79.6 80.4 77.7 70 98.8 97.9 84.1 85.3 85.5 83.9 90 99.1 98.0 90.4 92.6 89.7 88.5 Table 7 7 (%) The average accuracies of the results on Experiment 2 (%) SVM [15] LapSVM [15] G50c 94.6 88.9 85.2 90.3 95.0 Coil20 87.8 72.1 73.5 77.4 85.4 Uspst 84.2 74.2 71.6 77.9 82.3 7,, RKHS, SVM (Laplacian SVM, LapSVM), SVM LapSVM 4, 4,, ;, RKHS,,., ; RKHS, G50c [15], [15], RKHS. 5.,,,,.,,

7 : 895 ;,,,,. References 1 Chapelle O, Scholkopf B, Zien A. Semi-Supervised Learning. Cambridge: MIT Press, 2006 2 Rasmussen C E, Christopher K I W. Gaussian Processes for Machine Learning. Cambridge: MIT Press, 2006 3 Rasmussen C E. Advances in Gaussian processes. In: Proceedings of the Neural Information Processing Systems Conference. Cambridge, USA: MIT Press, 2006. 1 55 4 Rogers S, Girolami M. Multi-class semi-supervised learning with the E-truncated multinomial probit Gaussian process. In: Proceedings of the Gaussian Processes in Practice Workshop. Berlin, German: Springer, 2007. 17 32 5 Lawrence N D. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. The Journal of Machine Learning Research, 2005, 6: 1783 1816 6 Lawrence N D, Candela J Q. Local distance preservation in the GP-LVM through back constraints. In: Proceedings of the 23rd International Conference in Machine Learning. Pittsburgh, Pennsylvania: ACM, 2006. 513 520 7 Lawrence N D. Learning for larger datasets with the Gaussian process latent variable model. In: Proceedings of the 11th International Workshop on Artificial Intelligence and Statistics. San Juan, Puerto Rico: Morgan Kaufmann Publishers, 2007. 1 8 8 Lawrence N D, Moore A J. Hierarchical Gaussian process latent variable models. In: Proceedings of the 24th International Conference in Machine Learning. Corvalis, Oregon: ACM, 2007. 481 488 9 Urtasun R, Darrell T. Discriminative Gaussian process latent variable model for classification. In: Proceedings of the 24th International Conference in Machine Learning. Corvalis, Oregon: ACM, 2007. 927 934 10 Joachims T. Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann Publishers, 1999. 200 209 11 Zhu X J, Kandola J, Ghahramani Z, Laffert J. Nonparametric transforms of graph kernels for semi-supervised learning. In: Proceedings of the Conference on Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2005. 1641 1648 12 Chapelle O, Weston J, Scholkopf B. Cluster kernels for semisupervised learning. In: Proceedings of the Conference on Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2002. 585 592 13 Kondor R I, Lafferty J D. Diffusion kernels on graphs and other discrete input spaces. In: Proceedings of the 19th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann Publishers, 2002. 315 322 14 Zhu X J, Ghahramani Z, Lafferty J D. Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning. Washington D. C., USA: MIT Press, 2003. 912 919 15 Sindhwani V, Niyogi P, Belkin M. Beyond the point cloud: from transductive to semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany: MIT Press, 2005. 824 831 16 Sindhwani V, Chu W, Keerthi S S. Semi-supervised Gaussian processes classifiers. In: Proceedings of the International Joint Conference on Artificial Intelligence. San Francisco, USA: Morgan Kaufmann Publishers, 2007. 1059 1064 17 Kimeldorf G S, Wahba G. Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 1971, 33(1): 82 95 18 Cristianini N, Shawe-Taylor J, Elisseeff A, Kandola J. On kernel-target alignment. In: Proceedings of the Conference on Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2002. 367 373 19 Lanckriet G R G, Cristianini N, Bartlett P, Ghaoui L E, Jordan M I. Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 2004, 5: 27 72 20 Minka T P. A Family of Algorithms for Approximate Bayesian Inference [Ph. D. dissertation], Massachusetts Institute of Technology, USA, 2001 21 Kullback S. Information Theory and Statistics. New York: Dover, 1959 22 Smith B, Bjrstad P, Gropp W. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge: Cambridge University Press, 1996.,.. E-mail: hwli@nlpr.ia.ac.cn (LI Hong-Wei Ph. D. candidate at the Institute of Automation, Chinese Academy of Sciences. His research interest covers image and video processing, and machine learning. Corresponding author of this paper.).,. E-mail: liuyang@nlpr.ia.ac.cn (LIU Yang Ph. D. candidate at the Institute of Automation, Chinese Academy of Sciences. His research interest covers image and video processing, and machine learning.).,. E-mail: luhq@nlpr.ia.ac.cn (LU Han-Qing Professor at the Institute of Automation, Chinese Academy of Sciences. His research interest covers image and video processing, and multimedia technique.). 2008.. E-mail: ykfang@gmail.com (FANG Yi-Kai Researcher at Nokia Research Center, China. He received his Ph. D. degree at the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science in 2008. His research interests covers image processing and pattern recognition.)