BoVW. (Histogram Encoding) [2], [5], [6] [7], [8], (Fisher Encoding) [3] VLAD [9] Super Vector [10] Locality Constrained [11], [12], [13]

1,a) 1 2 1 SIFT Bag-of-Visual-Words Bag-of-Visual-Words 1. BoVW [2] BoVW Dense [3] Interest Point [4] 1 376 8515 1 5 1 School of Science and Technology, Gunma University Tenjin-cho 1 5 1, Kiryu-shi, Gunma, 376 8515 Japan 2 980 8579 6 6 05 Graduate School of Engineering, Tohoku University, 6 6 05, Aramaki Aza Aoba, Aoba-ku, Sendai, 980 8579, Japan a) matsuzawa-tomoki@kato-lab.cs.gunma-u.ac.jp BoVW (Histogram Encoding) [2], [5], [6] [7], [8], (Fisher Encoding) [3] VLAD [9] Super Vector [10] Locality Constrained Linear Encoding [11] [11], [12], [13] (Average Pooling) (Max Pooling) [11], [12], [13] [5], [13] [12] Chatfield [14] [3] [6] [15] BoVW c 2015 Information Processing Society of Japan 1

[16], [17]. BoVW K- Visual Word [18] 2. : BoVW (FV) 2.1 Bag-of-Visual-Words (BoVW) BoVW [2] BoVW. (e.g. SIFT [19]), Visual Word [14]. Visual Word. [12] Visual Word K K Visual Word 2.2 (FV) (FV) θ = [θ 1,..., θ m ] p(x θ) [20] X FV T X := [x 1,..., x T ] FV f(x) := θ1 log p(x θ) EX (( θ1 log p(x θ)) 2 ). θm log p(x θ) EX (( θm log p(x θ)) 2 ). p(x θ) K p(x θ) = T K π k N (x t ; µ k, diag(σ k ) 2 ) π k k π k = 1 µ k R d σ k R d θ = [ π 1,..., π K, µ 1,..., µ K, σ1,..., σk ] (2d + 1)K FV (2d + 1)K K π 1,..., π K K FV 2dK [21], [22], [23] 2dK FV [24] E X (( θi log p(x θ)) 2 ) (a) (responsibility) [25] (b) T FV f euc (X) := [ f µ1 euc(x),..., f µk euc (X), f σ1 euc(x),..., f σk euc (X) ] k = 1,..., K c 2015 Information Processing Society of Japan 2

f µ k euc(x) 1 T πk diag(σ k ) 1 Y k,euc γ k,euc, f σ 1 ( k euc(x) diag(σk ) 2 Y k,euc Y k,euc 1 d 1 ) 2T T γk,euc πk Y k,euc := X µ k 1 T γ k,euc R T k γ k,euc t π k N (x t ; µ k, diag(σ k ) 2 ) k π k N (x t ; µ k, diag(σ k ) 2 ). BoVW FV N (x t ; µ k, diag(σ k ) 2 ) Visual Word FV 3. : BoVW FV [1], [26] x, x R d D(x, x ; A) := (x x ) A(x x ) A ( ) ( ) 3.1 BoVW [18] BoWV K Visual Word V := [v 1,..., v K ] R d K J cb-ho (V ; A) := T min D(x t, v kt ; A) k t {1,...,K} t=1 K Greedy T BoVW Bagof-Visual-Words(HoMahaBoVW) d Visual Word v k A k BoVW Bag-of-Visual-Words(HeMahaBoVW) HeMahaBoVW x k argmin D(x, v k ; A k ). k {1,...,K} Visual Word v k 3.2 FV [18] FV A = UΛU W := Λ 1/2 U x (i.e. D(x, x ; A) = W x W x ) [18] : p ho (X θ) := det(w ) T T K π k N (W x t ; µ k, diag(σ k ) 2 ). FV f ho (X) (HoMahaFV) HoMahaFV f ho (X) := [ f µ1 ho (X),..., f µk ho (X), f σ1 ho (X),..., f σk ho (X) ] E X (( θi log p ho (X θ)) 2 ) FV (2.2 ) f µ k ho (X) 1 T πk diag(σ k ) 1 Y k γ k, f σ k ho (X) 1 2T πk ( diag(σk ) 2 Y k Y k 1 d 1 T ) γk Y k := W X µ k 1 T γ k R T t π k N (W x t ; µ k, diag(σ k ) 2 ) k π k N (W x t ; µ k, diag(σ k ) 2 ). HoMahaFV W HoMahaFV (HeMahaFV) FV W 1,..., W K : c 2015 Information Processing Society of Japan 3

p he (X θ) := T K π k det(w k ) N (W k x t ; µ k, diag(σ k ) 2 ). W k k A k Y k := W k X µ k 1 T p he(x θ) µ k σ k FV f µ k he (X) 1 T πk diag(σ k ) 1 Y kγ k, f σ k he (X) 1 2T πk ( diag(σk ) 2 Y k Y k 1 d 1 T ) γ k T γ k t π k det(w k ) N (W k x t ; µ k, diag(σ k ) 2 ) k π k det(w k ) N (W k x t ; µ k, diag(σ k ) 2 ). 4. HeMahaFV 4.1 HeMahaBoVW HeMahaBoVW K Visual Word A 1,..., A K HeMahaBoVW c A (c) c A (c) 0.05 Visual Word Visual Word K K 2 HeMahaBoVW A (c) W (c) K 2 K 2 K p he (X θ) Greedy EM EM E-step M-step c K µ 1,c,..., µ K,c R d, σ 1,c,..., σ K,c R d, π c R K T x 1,..., x T : L(θ (c) ) := det(w (c) ) T π k,c N (W (c) x t ; µ k,c, diag(σ k,c ) 2 ) T K θ (c) := {µ 1,c,..., µ K,c, σ 1,c,..., σ K,c, π c } EM Algorithm 1 EM Algorithm for HeMahaFV Method Input: Observation x 1,..., x T R d and initial values µ (0) 1,c,..., µ(0) K,c Rd, σ (0) 1,c,..., σ(0) K,c Rd, π c R K 1: for l = 1, 2,... do 2: E step Compute γ (l) t,k := π(l 1) k,c k π(l 1) k,c N (W (c) x t ; µ (l 1) k,c, diag(σ (l 1) k,c ) 2 ) N (W (c)x t ; µ (l 1) k,c, diag(σ(l 1) k,c )2 ). for (t, k) {1,..., T } {1,..., K }; 3: M step Update the parameter values by µ (l) k,c := t γ(l) t,k W (c)x t. (σ (l) k,c )2 := π (l) k,c := 1 T t γ(l) t,k t γ(l) for k = 1,..., K ; 4: end for t k,c )) t,k ((W (c)x t µ (l) k,c ) (W (c)x t µ (l) γ (l) t,k. t γ(l) t,k. 4.2 HeMahaFV HeMahaFV K 2 K 2 5. HeMahaBoVW, HeMahaFV 6 c 2015 Information Processing Society of Japan 4

(a) FMD 2 LSP15 15, 200 30 10 (b) LSP15 1 Categorization Performance. BoVW, FV (HoEucBoVW, HoEucFV) HeMahaBoVW (HeEucBoVW) HeMahaFV θ (HeEucFV) [18] (HoMahaBoVW, HoMahaFV) 5.1 3 Dense SIFT Visual Word BoVW K = 1024 FV K = 256 FV Power L2 [27] HeEucFV HeMahaFV L2 Flickr Material Database(FMD)[28] LSP15[5] One-vs-rest SVM FMD Flickr.com 10 100 5.2 1 FMD LSP15 BoVW FV HeMahaFV 5.3 101 CalTech 101(Cal 101)[29] (Cal 10 Cal 20... Cal 50) (Cal 10 Cal 20... Cal 50) Cal 101 10 20 10 50 Cal 101 Top-N Accuracy 2 N 10 1 2 HeMahaFV FV(EucHoFV) 6. BoVW FV HeMahaBoVW HeMahaFV HeMac 2015 Information Processing Society of Japan 5

habovw HeMahaFV HeMahaFV BoVW 1 (background clutter) Ramazan [22] Fraz [4] Mid-Level BoVW Visual Word [11], [12], [13] Visual Word [1] Kato, T., Takei, W. and Omachi, S.: A Discriminative Metric Learning Algorithm for Face Recognition, IPSJ Transactions on Computer Vision and Applications, Presented at MIRU2013 as Oral Presentation, Vol. 5, pp. 85 89 (2013). [2] Csurka, G., Dance, C., Fan, L., Willamowski, J. and Bray, C.: Visual categorization with bags of keypoints, Workshop on statistical learning in computer vision, ECCV, Vol. 1, p. 22 (2004). [3] Sánchez, J., Perronnin, F., Mensink, T. and Verbeek, J.: Image classification with the Fisher vector: Theory and practice, International journal of computer vision, Vol. 105, No. 3, pp. 222 245 (2013). [4] Fraz, M., Edirisinghe, E. A. and Sarfraz, M. S.: Midlevel-Representation Based Lexicon for Vehicle Make and Model Recognition, Pattern Recognition (ICPR), 2014 22nd International Conference on, IEEE, pp. 393 398 (2014). [5] Lazebnik, S., Schmid, C. and Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, Vol. 2, IEEE, pp. 2169 2178 (2006). [6] Sivic, J. and Zisserman, A.: Efficient Visual Search of Videos Cast as Text Retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 4, pp. 591 606 (2009). [7] Farquhar, J., Szedmak, S., Meng, H. and Shawe-Taylor, J.: Improving bag-of-keypoints image categorisation: Generative models and pdf-kernels (2005). [8] Winn, J., Criminisi, A. and Minka, T.: Object categorization by learned universal visual dictionary, Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, Vol. 2, IEEE, pp. 1800 1807 (2005). [9] Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P. and Schmid, C.: Aggregating local image descriptors into compact codes, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 34, No. 9, pp. 1704 1716 (2012). [10] Zhou, X., Yu, K., Zhang, T. and Huang, T. S.: Image classification using super-vector coding of local image descriptors, Computer Vision ECCV 2010, Springer, pp. 141 154 (2010). [11] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. and Gong, Y.: Locality-constrained linear coding for image classification, Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, pp. 3360 3367 (2010). [12] Boureau, Y.-L., Bach, F., LeCun, Y. and Ponce, J.: Learning mid-level features for recognition, Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, pp. 2559 2566 (2010). [13] Yang, J., Yu, K., Gong, Y. and Huang, T.: Linear spatial pyramid matching using sparse coding for image classification, Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp. 1794 1801 (2009). [14] Chatfield, K., Lempitsky, V., Vedaldi, A. and Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods (2011). [15] Boiman, O., Shechtman, E. and Irani, M.: In defense of nearest-neighbor based image classification, Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp. 1 8 (2008). [16] Cinbis, R. G., Verbeek, J. and Schmid, C.: Image categorization using Fisher kernels of non-iid image models, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, pp. 2184 2191 (2012). [17] Tanaka, M., Torii, A. and Okutomi, M.: Fisher Vector based on Full-covariance Gaussian Mixture Model, IPSJ Transactions on Computer Vision and Applications CVA Vol. 5, pp. 50 54 (2013). [18]. PRMU Vol. 113, No. 403, pp. 201 206 (2014). [19] Lowe, D. G.: Distinctive image features from scaleinvariant keypoints, International journal of computer vision, Vol. 60, No. 2, pp. 91 110 (2004). [20] Jaakkola, T., Haussler, D. et al.: Exploiting generative models in discriminative classifiers, Advances in neural information processing systems, pp. 487 493 (1999). [21] Ji, Z.: Decoupling Sparse Coding with Fusion of Fisher Vectors and Scalable SVMs for Large-Scale Visual Recognition, Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on, IEEE, pp. 450 457 (2013). [22] Cinbis, R. G., Verbeek, J. and Schmid, C.: Segmentation driven object detection with Fisher vectors, Computer Vic 2015 Information Processing Society of Japan 6

sion (ICCV), 2013 IEEE International Conference on, pp. 2968 2975 (2013). [23] Sydorov, V., Sakurada, M. and Lampert, C. H.: Deep Fisher Kernels End to End Learning of the Fisher Kernel GMM Parameters. [24] Perronnin, F. and Dance, C.: Fisher kernels on visual vocabularies for image categorization, Computer Vision and Pattern Recognition, 2007. CVPR 07. IEEE Conference on, IEEE, pp. 1 8 (2007). [25] Bishop, C. M.: Pattern Recognition and Machine Learning, Springer Science+Business Media, LLC, New York, USA (2006). [26] Weinberger, K. Q. and Saul, L. K.: Distance Metric Learning for Large Margin Nearest Neighbor Classification, J. Mach. Learn. Res., Vol. 10, pp. 207 244 (online), available from http://dl.acm.org/citation.cfm?id=1577069.1577078 (2009). [27] Perronnin, F., Sanchez, J. and Mensink, T.: Improving the fisher kernel for large-scale image classification, Computer Vision ECCV 2010, Springer, pp. 143 156 (2010). [28] Sharan, L., Rosenholtz, R. and Adelson, E.: Material perception: What can you see in a brief glance?, Journal of Vision, Vol. 9, No. 8, pp. 784 784 (2009). [29] Fei-Fei, L., Fergus, R. and Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories, Computer Vision and Image Understanding, Vol. 106, No. 1, pp. 59 70 (2007). c 2015 Information Processing Society of Japan 7