E-mail: nakayama@ci.i.u-tokyo.ac.jp Abstract (C C (FWM C 1 (deep learning [12, 17] (convolutional neural networks, C [25, 30, 19] [5, 21, 23] C [18] C C GPU [5, 21] C C [11] [16] [2] C (FWM FWM FWM 2 (FWM [29] [33] [1] FWM
Layer 0 (Raw image Patch descriptor Convolutions using Fisher weight map Pooling & rectification OUTPUT Layer 4 Full connection (Logistic regression Layer 3 Layer 2 Layer 1 Pooling 1 [15] spatial pyramids [22] pyramid matching FWM FWM FWM PCA (EWM [29] C EWM FWM 3 1 k (L k m k 1 P k P k (L 0 RGB 3 map 1 map 2 map mk ( x, y ( k f( x 1, y ( k f( x, y 1 ( k f( x+ 1, y M x= ( k ( x, Fisher weight map or Eigen weight map w 1 w mk +1 w 2 Layer k Layer k+1 2 y 1 1 ( x, y map 1 ( z 1 map 2 ( z 2 map m k+ 1 ( z mk +1 [17] 3.1 2 f (k (x,y R m k L k (x, y f (k (x,y n n f (k x (k (x,y Rm k n 2 (x,y (P k n +1 (P k n +1 x (k (x,y 2 ( X = x (k (1,1 x(k (2,1 x(k (P k n+1,p k n+1. (1 z z =(X X T w X X EWM FWM m k+1 EWM FWM 1 FWM 2
Eigen weight map (EWM EWM z J E (w J E (w = 1 (z i z T (z i z i=1 { } = w T 1 (X i X(X i X T w i=1 = w T Σ X w, (2 z z i J E (w Σ X w = λw. (3 EWM PCA Fisher weight map (FWM FWM z EWM FWM J F (w EWM Σ W Σ B z Σ W = 1 Σ B = 1 C j j=1 i=1 (z (j i z (j (z (j i z (j T, (4 C j ( z (j z( z (j z T, (5 j=1 C j j, z (j i j i z (j Σ W Σ B tr Σ W = 1 C j (z (j i z (j T (z (j i z (j j=1 i=1 = w T 1 C j (X (j i X (j (X (j i X (j T w j=1 i=1 = w T Σ W w. (6 tr Σ B = 1 C j ( z (j z T ( z (j z j=1 = w T 1 C j ( X (j X( X (j X T w j=1 = w T Σ B w. (7 J F (w = tr Σ B tr Σ = wt Σ B w W w T Σ W w. (8 Σ B w = λσ W w. (9 3.2 tanh [25, 19]Rectified Linear Units (ReLU [27, 21] x R(x = max(0,x. ReLU ReLU Coates [8] ( max(0,x R 2 (x =. (10 max(0, x 3.3 sub-sampling [30] [6, 5, 34] (average pooling (max pooling, L2 [4] L2 3.4 5 5 5 5 3=75
unsupervised : -0.05 0.05 [19]C MIST [24] K-means: Coates [6, 9] bag-of-words [10] zero component analysis (ZCA K-means visual words triangular encoding K-means 3.5 Rand(n, d: n n d K m (n, d: n n d K-means (bag-of-words C(n, m: n n m C EWM, C FWM FWM R, R 2 : Rectified linear units AP [MP,L 2 P ](n, s: [ L2 ] s n n AP [MP,L 2 P ] p : [ L2 ] p p p 2 Rand(5, 200-R-AP (4, 4-C(3, 100-R-AP 2 (1 200 (2 ReLU (3 4 4 (4 3 3 FWM 100 (5 ReLU 3 STL-10 [6]CIFAR-10/100 [20]MIST [24] (6 2 2 4 12 CPU (Xeon 2.7GHz PC GPGPU 4.1 STL-10 [6]CIFAR-10/100 [20], MIST [24] 3 STL-10 96 96 10 10 100 Gens [13] 10 CIFAR-10/100 Tiny images [32] 10 100 32 32 CIFAR-10 5000 CIFAR-100 500 MIST 28 28 0 9 10 6000 1000
60 58 56 54 52 50 48 46 44 42 40 Km(9,256-AP (bag-of-words baseline 2PCA PCAW EWM EWMW FWM X Y 4 X Y (STL-10 K m (9, 256-AP (4, 2-C X (3, 256-Y -AP 2 78 76 74 72 70 68 66 64 Km(5,256-AP (bag-of-words baseline 2PCA PCAW EWM EWMW FWM X R R2 Y 5 X Y (CIFAR-10 K m (5, 256-AP (3, 2-C X (3, 512-Y -AP 2 4.2 2 2 (AP 2, MP 2 (AP 4,5 STL-10, CIFAR-10 EWMFWM PCA PCA EWM PCAW EWMW K m (n, 256-AP 2 PCA EWM FWM STL-10 PCAWEWMW PCA R R2 1 (% (STL-10 Architecture Acc. K m (9, 256-AP 2 48.4 K m (9, 256-MP 2 54.1 K m (9, 256-L 2 P 2 51.4 K m (9, 256-AP (4, 2-C(3, 256-AP 2 56.4 K m (9, 256-MP(4, 2-C(3, 256-L 2 P 2 50.9 K m (9, 256-MP(4, 2-C(3, 256-AP 2 58.4 K m (9, 256-MP(4, 2-C(3, 256-MP 2 44.6 K m (9, 256-MP(4, 2-C(3, 256-R-L 2 P 2 59.6 K m (9, 256-MP(4, 2-C(3, 256-R-AP 2 60.0 K m (9, 256-MP(4, 2-C(3, 256-R 2 -L 2 P 2 61.0 K m (9, 256-MP(4, 2-C(3, 256-R 2 -AP 2 61.2 2 (% (CIFAR-10 Architecture Acc. K m (5, 256-AP 2 72.2 K m (5, 256-MP 2 68.3 K m (5, 256-L 2 P 2 71.9 K m (5, 256-MP(3, 2-C(3, 512-AP 2 70.4 K m (5, 256-AP (3, 2-C(3, 512-L 2 P 2 64.3 K m (5, 256-AP (3, 2-C(3, 512-AP 2 71.0 K m (5, 256-AP (3, 2-C(3, 512-MP 2 66.6 K m (5, 256-AP (3, 2-C(3, 512-R-L 2 P 2 73.8 K m (5, 256-AP (3, 2-C(3, 512-R-AP 2 74.5 K m (5, 256-AP (3, 2-C(3, 512-R 2 -L 2 P 2 76.3 K m (5, 256-AP (3, 2-C(3, 512-R 2 -AP 2 76.6 EWM FWM ReLU R 2 FWM 1,2 K-means STL-10 MPCIFAR-10 AP AP CIFAR-10 3
3 n d (% (CIFAR-10 K m (5,d-AP (3,2-C(n,512-R 2 -AP 2 n d 256 512 1024 3 76.6 (2304 77.1 (4608 78.1 (9216 4 77.3 (4096 78.3 (8192-5 77.2 (6400 77.8 (12800-4.3 4.2 FWM 4,5 ReLU ReLU CIFAR-10 ReLU 4.4 6 (* 4.3 STL-10MIST CIFAR-100 C CIFAR-10, CIFAR-100 CIFAR-100 CIFAR-100 Maxout [14] Stochastic pooling [34] C CIFAR- 10 CIFAR-10 5000 CIFAR-100 500 STL-10 FWM CIFAR-100 5 ReLU CPU fine-tuning JST CREST [1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. PAMI, 19(7:711 720, 1997. [2] Y. Bengio. Practical recommendations for gradient-based training of deep architectures. eural etworks: Tricks of the Trade, 2012. [3] L. Bo, X. Ren, and D. Fox. Unsupervised feature learning for RGB-D based object recognition. In Proc. ISER, 2012. [4] Y. L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In Proc. ICML, 2010. [5] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In Proc. IEEE CVPR, 2012.
4 (% (STL-10 Architecture Acc. (1 K m (9, 256-MP 2 54.1 (2 K m (9, 256-MP(4, 2-C(3, 256-R 2 -AP 2 61.2 (3 K m (9, 256-MP(4, 2-C(3, 256-AP (4, 2-C(3, 256-R 2 -AP 2 64.0 (4 K m (9, 256-MP(4, 2-C(3, 256-R-AP (4, 2-C(3, 256-R 2 -AP 2 63.3 (5 K m (9, 256-MP(4, 2-C(3, 256-R 2 -AP (4, 2-C(3, 256-R 2 -AP 2 64.2 (6 K m (9, 256-MP(4, 2-C(3, 256-AP (4, 2-C(3, 256-AP (4, 2-C(3, 256-R 2 -AP 2 65.7 (2+(3+(6 66.0 5 (% (CIFAR-10 Architecture Acc. (1 K m (5, 256-AP 2 72.2 (2 K m (5, 256-C(3, 512-R 2 -AP 2 76.3 (3 K m (5, 256-C(3, 256-AP (3, 2-C(3, 512-R 2 -AP 2 77.0 (4 K m (5, 256-C(3, 256-R-AP (3, 2-C(3, 512-R 2 -AP 2 76.4 (5 K m (5, 256-C(3, 256-R 2 -AP (3, 2-C(3, 512-R 2 -AP 2 76.4 (6 K m (5, 256-C(3, 256-AP (3, 2-C(3, 256-AP (3, 2-C(3, 512-R 2 -AP 2 76.4 (2+(3+(6 79.1 [6] A. Coates, H. Lee, and A. Y. g. An analysis of singlelayer networks in unsupervised feature learning. In Proc. AISTATS, 2011. [7] A. Coates and A. g. Selecting receptive fields in deep networks. In Proc. IPS, 2011. [8] A. Coates and A. g. The importance of encoding versus training with sparse coding and vector quantization. In Proc. ICML, 2011. [9] A. Coates and A. g. Learning feature representations with K-means. eural etworks: Tricks of the Trade, 2012. [10] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Proc. ECCV Workshop on Statistical Learning in Computer Vision, 2004. [11] D. Erhan, Y. Bengio, A. Courville, P. A. Manzagol, and P. Vincent. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625 660, 2010. [12] K. Fukushima. eocognitron for handwritten digit recognition. eurocomputing, 51:161 180, 2003. [13] R. Gens and P. Domingos. Discriminative learning of sumproduct networks. In Proc. IPS, 2012. [14] I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In Proc. ICML, 2013. [15] T. Harada, Y. Ushiku, Y. Yamashita, and Y. Kuniyoshi. Discriminative spatial pyramid. In Proc. IEEE CVPR, pages 1617 1624, 2011. [16] G. Hinton,. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. In arxiv preprint, 2012. [17] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313:504 507, 2006. [18] D. H. Hubel and T.. Wiesel. Receptive fields of single neurones in the cat s striate cortex. The Journal of physiology, 148:574 591, 1959. [19] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. Lecun. What is the best multi-stage architecture for object recognition? In Proc. IEEE ICCV, 2009. [20] A. Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, Toronto University, 2009. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imageet classification with deep convolutional neural networks. In Proc. IPS, 2012. [22] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. IEEE CVPR, volume 2, 2006. [23] Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, and A. Y. g. Building high-level features using large scale unsupervised learning. In Proc. ICML, 2012. [24] Y. LeCun. The MIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. of the IEEE, 1998. [26] M. Lin, Q. Chen, and S. Yan. etwork in network. In Proc. ICLR, 2014. [27] V. air and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. ICML, 2010. [28] M. A. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energybased model. In Proc. IPS, 2006. [29] Y. Shinohara and. Otsu. Facial expression recognition using Fisher weight maps. In IEEE FG, 2004. [30] P. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proc. ICDAR, 2003. [31]. Srivastava and R. Salakhutdinov. Discriminative transfer learning with tree-based priors. In Proc. IPS, number Icml, 2013. [32] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large dataset for nonparametric object and scene recognition. IEEE Trans. PAMI, 30(11:1958 70, ov. 2008. [33] M. Turk and A. Pentland. Face recognition using eigenfaces. In Proc. IEEE CVPR, 1991. [34] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. In arxiv preprint, 2013.
6 STL-10, CIFAR-10/100, MIST (%. STL-10 1-layer Sparse Coding [8] 59.0 3-layer Learned Receptive Field [7] 60.1 Discriminative Sum-Product etwork [13] 62.3 Hierarchical Matching Pursuit [3] 64.5 K m (9, 256-MP(4, 2-C(3, 256-AP (4, 2-C(3, 256-AP (4, 2-C(3, 256-R 2 -AP 2 65.7 K m (9, 1024-MP(4, 2-C(3, 512-AP (4, 2-C(3, 256-AP (4, 2-C(3, 256-R 2 -AP 2 66.4 K m (9, 1024-MP(4, 2-C(3, 512-AP (4, 2-C(3, 256-AP (4, 2-C(3, 256-R 2 -AP 2 (* 66.9 CIFAR-10 3-Layer Learned Receptive Field [7] 82.0 C [16] 83.4 Discriminative Sum-Product etwork [13] 84.0 C (1 locally connected layer [16] 84.4 C + Stochastic Pooling [34] 84.9 C + Maxout [14] 88.3 etwork in etwork [26] 89.6 K m (5, 1024-C(3, 256-AP (3, 2-C(3, 512-R 2 -AP 3 80.4 K m (5, 1024-C(3, 256-AP (3, 2-C(3, 256-AP (3, 2-C(3, 512-R 2 -AP 3 (* 81.9 CIFAR-100 C + Stochastic pooling [34] 57.49 C + Maxout [14] 61.43 C + Tree-based prior [31] 63.15 etwork in etwork [26] 64.32 K m (5, 6400-C(1, 1000-AP (4, 2-C(3, 1000-AP (3, 2-C(3, 1000-R 2 -AP 3 60.80 K m (5, 6400-C(1, 1000-AP (4, 2-C(3, 1000-AP (3, 2-C(3, 1000-R 2 -AP 3 (* 62.05 MIST C (Unsupervised pretraining [28] 99.40 C (Unsupervised pretraining [19] 99.47 C + Stochastic Pooling [34] 99.53 etwork in etwork [26] 99.53 C + Maxout [14] 99.55 Rand(5, 1024-R-AP (4, 2-C(3, 512-R 2 -AP 4 99.50 Rand(5, 2048-R-C(1, 1024-AP (4, 2-C(3, 512-R 2 -AP 4 99.60