On Represenations for Concept Operations of Convolutional Neural Networks

P3-21 On Represenations for Concept Operations of Convolutional Neural Networks Shin Asakawa Center for Information Sciences, Tokyo Woman s Christian University asakawa@ieee.org Abstract This article was aimed to prompt us to understand several key concepts of deep learnings progressing a great deal recently. Those were LeNet, AlexNet, GoogleLeNet, VGG, MSRA(SPP-net), NIN, R-CNN, and Selective search. We also described not only these state-of-the-art algorithms in detail but also the pros and cons of them. Furthermore, these models would show us importand feed backs to understandings of our brains themselves. Such mutual interactions, between progression of neural networks and knowledge about our brains, would give us deeper understanding ourselves. Keywords Convolutional Neural Networks, Max-pooling, Higer-order Cognitive Processes, Concept acquisitions 1. [Hubel 59, Hubel 68] (Convolutional Neural Networks: CNN) [LeCun 98] [Russakovsky 15] [Russakovsky 15] [Zhou 14] [Karayev 14] [Girshick 14] [Donahue 14] [Girshick 14, Donahue 14, Long 15] [Levine 15] [Fukushima 82] CNN+Max-pooling CNN CNN+ (Max-pooling) [Rogers 04] [Kemp 09] [Osherson 90] CNN Max-pooling 2. CNN LeNet (deeper successors, [Long 15] ) AlexNet, GoogleLeNet, VGG, MSRA 2.1 LeNet 1 LeNet [LeCun 98] 2 Max-pooling CNN[Ranzato 07] ( 1) f (x χ) G (χ) dχ d f (x) x G [sin, cos] (x)exp( x) ( 2) [Chatfield 14] CNN 622

P3-21 2 f : what q : where [Ranzato 07] 2 2 CNN ( ) ( ( 2π X 2 G (x, y) =cos λ X + γ 2 Y 2) ) exp 2σ 2, (1) X = x cos θ + y sin θ, Y = x sin θ + y cos θ θ σ λ [Serre 05] CNN x w f ( what ) q ( where ) CNN+Maxpooling 2 f = f f (x; w e ) q = f q (x; w e ) f f f q w e f d (f, q; w d ) x H d = x f d (f, q; w d ) 2 H e = f f e (x, q; w e ) 2 EM f H d + αh e, (α >0) f f α =1 w e w d 1. x q f 0 = f e (x, q; w e ) q 2. q f 0 H d + αh e f f 3. : Δw d x f d (f, q; w d ) 2 w d 1 4. f 1 Δw e f f e (x, q; w e ) 2 w e. f f 1 f 2 CNN Max-pooling [LeCun 98] [Ranzato 07, Scherer 10] Max-pooling ( where ) Max-pooling (1) Max-pooling (2) (3) 623

P3-21 2.2 AlexNet 2010 (ILSVRC) 2012 SVM [Haussler 92] 10% AlexNet [Krizhevsky 12] ILSVRC2012 1 ( ) 26.2% 5 15.3% AlexNet (1) 3 [Krizhevsky 12] 2 AlexNet CNN s z z s = z s <z AlexNet s =2,z =3 3 2 2 GPU 3 4 4 5 (Rectified Linear Unit: ReLU), (2) (Local Responese Normalization: LRN), (3) dropout, (4) (5) GPU f (x) =(1+exp( x)) 1 f (x) =tanh(x) ReLU f (x) =max(0,x) ReLU x = f (x) = ReLU a b b i = a i ( κ + α min(n 1,i+n/2) j=max(0,i n/2) (a j) 2) β, (2) (Local Response Normalization: LRN) n N κ = 2, n = 5, α =10 4, β =0.75 a S (1 + exp ( x)) 1 tanh (x) LRN ReLU 2.3 GoogleLeNet GoogleLeNet ILSVRC2014 LeNet 5 [LeCun 98] ILSVRC2014 CPU R-CNN GoogleLeNet 4 4 GoogleLeNet [Szegedy 15] 3 5 ( ) (1) (2) CPU GPU 624

P3-21 GoogleLeNet ( 5) 5 5 GoogleLeNet [Szegedy 15] 2 Max-pooling GoogleLeNet 1. 5 5 3 2. 1 1 ReLU 3. 1024 ReLU 4. dropout 0.7 5. softmax 1000 6. (Stochastic gradient descent) 0.9, 8 0.04 2.4 Very Deep (VGG) VGG ILSVRC2014 1 1 2 VGG 16 19 CNN 3 3 224 224 RGB RGB 2 1 1 8 3 3 1 Google DeepMind VGG 3 3 ( ) 1 (spatial padding) 1 3 3 Max-pooling 2 2 2 VGG 3 5 Max-pooling 4 softmax VGG A E 5 LRN A-LRN 6 GoogleLeNet VGG 22 1 1, 3 3, 5 5 (L2 ) 5 10 4 dropout 0.5 GoogleLeNet VGG 5 Google- LeNet 6.7%, VGG 6.7% 5.1% [Russakovsky 15] 4.94% arxiv [He 15a] 2 2.5 MSRA MSRA ILSVRC2014 SPP-net ( Spatial Pyramid Pooling) [He 15b] 2 3 SPP 6 GoogleLeNet ( 5) GoogleLeNet SPP-net AlexNet ( 3) (5 ) (6 ) SPP ( 6) 5 6 SPP SPP 2 (http://www. technologyreview.com/view/538111/ why-and-how-baidu-cheated-an-artificial-intelligence-test/) 625

P3-21 NIN Vapnik[Vapnik 71] [Vapnik 95] 8 [Lin 14] 2 6 [He 15b] 3 7 SPP 7 1. 2., 3. SPP 3. Regional CNN (R-CNN) SNS (MRF), (CRF) BOW (Bag-Of-Words) BOVW (Bag-of-Visual-Words) 7 [He 15a] 5 9 R-CNN[Girshick 14] 1 2.6 Network-In-Network (NIN) Network-in-Network(NIN) [Lin 14] Max-pooling (Global average pooling) Maxpooling Maxout [Goodfellow 13] Maxout Girshick [Girshick 14] CNN SVM[Haussler 92] 3.1 [Uijlings 13] 626

P3-21 [Girshick 15] 3.2 DeepFace 10 LRCN (Long Recurrent Convolutional Netowrks) [Donahue 14] 1 R-CNN [Susskind08,Taigman14,Liu13] [Vinyals 15, Fang 15] DeepFace [Taigman 14] 4030 440 DeepFace 12 [Taigman 14] 1 11 [Uijlings 13] 2, 3 Object Hypothesis R-CNN CNN [Long 15] CNN CNN (Fully Connected Networks: FCN) CNN CNN [Hariharan 14] R-CNN (1) 2 (2) 3 3 (3) (4) 2 6 Local Binary Pattern Support Vector Regression 3 67 3D 3 3 DeepFace DeepFace 3 1 Max-pooling LFW (Labeled Faces in the 627

P3-21 [Donahue 14] ( 15) 13 [Taigman 14] 2 Wild) 97.25% 97.53% 14 [Fan 14] 14 CNN [Fan 14] 2 CNN CNN [Bromley 94] 3 2 CNN 2 ID CNN CNN CNN greedy 1 LFW 97.3% 4 CNN 3 2 4 FACE++, http://www.faceplusplus.org 15 [Donahue 14] 3 4. CNN what where R-CNN ( [Rosch 76]) [Warrington 75] CNN+Max-pooling Max-pooling 15 628

P3-21 QOL [Bromley 94] Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., and Shah, R.: Signature Verification using a Siamese Time Delay Neural Network, in Cowan, J., Tesauro, G., and Alspector, J. eds., Advances in Neural Information Processing Systems 6, pp. 737 744, Morgan-Kaufmann (1994) [Chatfield 14] Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A.: Return of the Devil in the Details: Delving Deep into Convolutional Nets, in British Machine Vision Conference (2014) [Donahue 14] Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalany, S., Saenko, K., and Darrell, T.: Long-term Recurrent Convolutional Networks for Visual Recognition and Description,. (2014) [Fan 14] Fan, H., Cao, Z., Jiang, Y., Yin, Q., and Doudou, C.: Learning Deep Face Representation, CoRR, Vol. abs/1403.2802, (2014) [Fang 15] Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, L., and Zweig, G.: From Captions to Visual Concepts and Back, in The proceedings of CVPR, IEEE Institute of Electrical and Electronics Engineers, Boston, MA, USA (2015) [Fukushima 82] Fukushima, K. and Miyake, S.: Neocognitron: A New Algorithm for Pattern Recognition tolerant of Deformations and Shifts in Position, Pattern Recognition, Vol. 15, pp. 455 469 (1982) [Girshick 14] Girshick, R., Donahue, J., Darrell, T., and Malik, J.: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, in Proceedings of Computer Vision and Pattern Recognition Conference (CVPR), Columbus, Ohio, USA (2014) [Girshick 15] Girshick, R.: Fast R-CNN (2015) [Goodfellow 13] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y.: Maxout Networks, arxiv:1302.4389v4, stat.ml (2013) [Hariharan 14] Hariharan, B., Arbeláaez, P., Girshick, R., and Malik, J.: Simultaneous Detection and Segmentation, in Proceedings of the European Conference on Computer Vision, Zurich, Switterland (2014) [Haussler 92] Haussler, D. ed.: A Training Algorithm for Optimal Margin ClassifiersACM press (1992) [He 15a] He, K., Zhang, X., Ren, S., and Sun, J.: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Technical report, Microsoft (2015) [He 15b] He, K., Zhang, X., Ren, S., and Sun, J.: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015, pp. 1 1 (2015) [Hubel 59] Hubel, D. and Wiesel, T. N.: Receptive Fields of Single Neurones in the Cat s Striate Cortex, Journal of Physiology, Vol. 148, pp. 574 591 (1959) [Hubel 68] Hubel, D. and Wiesel, T. N.: Receptive Fields and Functional Architecture of Monkey Striate Cortex, Journal of Physiology, Vol. 195, pp. 215 243 (1968) [Karayev 14] Karayev, S., Trentacoste, M., Han, H., Agarwala, A., Darrell, T., Hertzmann, A., and Winnemoeller, H.: Recognizing Image Style, in Proceedings of the British Machine Vision Conference, BMVAPress (2014) [Kemp 09] Kemp, C. and Tenenbaum, J. B.: Structured Statistical Models of Inductive Reasoning, Psychological Review, Vol. 116, No. 1, pp. 20 58 (2009) [Krizhevsky 12] Krizhevsky, A., Sutskever, I., and Hinton, G. E.: ImageNet classification with deep convolutional neural networks, in Bartlett, P. L., Pereira, F. C. N., Burges, C. J. C., Bottou, L., and Weinberger,K.Q.eds.,Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, USA (2012) [LeCun 98] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.: Gradient-based learning applied to document recognition, Proceedings of the IEEE, Vol. 86, pp. 2278 2324 (1998) [Levine 15] Levine, S., Finn, C., Darrell, T., and Abbeel, P.: End-to-End Training of Deep Visuomotor Policies, Technical report, Berkely Vison and Learning Center BVLC, Report Series 100 (2015) [Lin 14] Lin, M., Chen, Q., and Yan, S.: Network In Network, arxiv:1312.4400v3 (2014) [Liu 13] Liu, M., Li, S., Shan, S., and Chen, X.: AU-aware Deep Networks for facial expression recognition, in Automatic Face and Gesture Recognition (FG) on the 10th IEEE International Conference and Workshops, pp. 1 6, Shanhai, China (2013), IEEE [Long 15] Long, J., Shelhamer, E., and Darrell, T.: Fully Convolutional Networks for Semantic Segmentation, Computer Vision and Pattern Recognition (CVPR) (2015) [Osherson 90] Osherson, D. N., Wilkie, O., Smith, E. E., and López, A.: Category-based induction, Psychological Review, Vol. 97, No. 2, pp. 185 200 (1990) [Ranzato 07] Ranzato, M., Huang, F., Boureau, Y., and LeCun, Y.: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition, in 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, Minnesota, USA (2007) [Rogers 04] Rogers, T. T. and McClelland, J. L.: Semantic Cognition: A Parallel Distributed Processing Approach, The MIT press, Cambridge, MA (2004) [Rosch 76] Rosch, E., Mervis, C. B., Gray, W. D., M, D., and Boyes-braem, P.: Basic objects in natural categories, Cognitive Psychology (1976) [Russakovsky 15] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision (2015) [Scherer 10] Scherer, D., Müller, A., and Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition, in 20th International Conference on Artificial Neural Networks (ICANN), pp. 629

P3-21 92 101, Thessaloniki, Greece (2010) [Serre 05] Serre, T., Wolf, L., and Poggio, T.: Object Recognition with Features Inspired by Visual Cortex, in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 994 1000, San Diego, CA, USA (2005) [Susskind 08] Susskind, J. M., Anderson, A. K., Hinton, G. E., and Movellan, J. R.: Generating Facial Expressions with Deep Belief Nets, in Or, J. ed., Affective Computing, chapter 23, pp. 421 440, INTECH Open Access Publisher (2008) [Szegedy 15] Szegedy, C., Liu, W., Jia, Y., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A.: Going deeper with convolutions, in Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA (2015) [Taigman 14] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L.: DeepFace: Closing the Gap to Human- Level Performance in Face Verification, in Proceedings of Computer Vision and Pattern Recognition Conference (CVPR), Columbus, Ohio, USA (2014) [Uijlings 13] Uijlings, J. R. R., Sande, van de K. E. A., Gevers, T., and Smeulders, A. W. M.: Selective Search for Object Recognition, International Journal of Computer Vision, Vol. 104, No. 2, pp. 154 171 (2013) [Vapnik 71] Vapnik, V. N. and Chervonenkis, A. Y.: On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities, Theory of Probability and Its Applications, Vol. 16, No. 2, pp. 264 280 (1971) [Vapnik 95] Vapnik, V. N.: The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY, USA (1995) [Vinyals 15] Vinyals, O., Toshev, A., Bengio, S., and Erhan, D.: Show and Tell: A Neural Image Caption Generator, in Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA (2015) [Warrington 75] Warrington, E. K.: The Selective impairment of semantic memory, Quarterly Journal of Experimental Psychology, Vol. 27, pp. 635 657 (1975) [Zhou 14] Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A.: Learning Deep Features for Scene Recognition using Places Database, in Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. eds., Advances in Neural Information Processing Systems 27, pp. 487 495, Curran Associates, Inc. (2014) 630