59 1 A Survey of Recent Clustering Methods for Data Mining (part 1) Try Clustering! Toshihiro Kamishima National Institue of Advanced Industrial Science and Technology (AIST) mail@kamishima.net, http://www.kamishima.net/ keywords: Clustering, Unsupvervised Learning, Survey, Data Mining 1. 1 internal cohesion external isolation [Everitt 93, 85] [Fayyad 96] 1 7 2 2 5 1 2 1 2 3 4 5 Web 2 6 WWW 7 8 9 10 11 12 1 2. Everitt [Everitt 93] Jain Dubes [Jain 88] Jain [Jain 99] [Jain 00] [ 99] [Everitt 93] [Jain 88] [Jain 88] [ 99] [Fisher 91] CLUSTER 1 x i : object N : X={x 1,...,x N } : d : x i=(x i1,...,x id ) : x i D 1,...,D d : k : C 1,...,C k : n i : C i c i : C i D(x i,x j ) : x i x j
60 18 1 2003 1 [Michalski 83] COBWEB [Fisher 87] [ 96] KDD [Keim 99] ACM SIGMOD [Hinneburg 99] Hinneburg Keim http://hawaii.informatik.uni-halle.de /~hinnebur/clustertutorial/ [ 01] BIRCH 3. (hierarchical) k-means (partitioning-optimization) (divisive) (agglomerative) 1 N 3(b) 2 C 1 C 2 D(C 1,C 2 ) (nearest neighbor method) (single linkage method) D(C 1,C 2 )= min D(x 1,x 2 ) x 1 C 1,x 2 C 2 (furthest neighbor method) (complete linkage method) D(C 1,C 2 )= max D(x 1,x 2 ) x 1 C 1,x 2 C 2 (group average method) D(C 1,C 2 )= 1 n 1 n 2 x 1 C 1 x 2 C 2 D(x 1,x 2 ) (Ward s method) D(C 1,C 2 )=E(C 1 C 2 ) E(C 1 ) E(C 2 ) E(C i )= x C i ( D(x,ci ) ) 2 1. k c 1,...,c k 2. x X min i D(x,c i ) 3. if then else 2. 1 k-means Ward D(x i,x j ) Ward partitional optimization N k-means k ( D(x,ci ) ) 2 i=1 x C i 1 4. Web 4 1 exploratory Cutting [Cutting 92]
1 61 Cutting 1990 8 5,000,,,,,,, 8 [Dubes 79] [Milligan 85] 4 2 [ 02] S1 r ar S2 δv 2 [ 98] [ 98] 2 r ar(0 < a < 1) d S 1 S 2 S 1 V δv δv/v = 1 a d δv/v d 1 d S 1 9 4 3 3(a) 3(b) (b) 1 (a) 3(b)
62 18 1 2003 1 4 k-means [Guha 98] library(mva) # x <- read.table("datafile") # cl <- kmeans(x, 2, 20) # plot(x, col=cl$cluster) # (a) 5 R k-means Ward (b) 3 [Everitt 93] k-means [Guha 98] 4 k-means k-means 7 10 k-means O(Nk) O(N 2 ) k-means 5. WWW 1 2 The R Project: http://www.r-project.org/ S R OS kmeans hclust 5 Netlib: http://www.netlib.org/ Scientific Applications on LINUX: http://sal.kachinatech.com/ LINUX StatPages.net: http://www.statpages.net/ Recursive-Partitioning.com: http://www.recursive-partitioning.com/ KDnuggets: http://www.kdnuggets.com/
1 63 2 xi1 xi2 xi3 v 6. Web 1 3 Jaccard [Jain 88] k-means Huang k-mode [Huang 98] simple matching Web 2 ROCK Guha ROCK RObust Clustering using links [Guha 99] Jaccard k n i link(x q,x r ) n 1+2f(θ) x q,x r C i i i=1 link(x q,x r ) x q x r n 1+2f(θ) i 3 CACTUS ROCK O(N 3 ) CACTUS CAtegorical ClusTering Using Summaries [Ganti 99] Ganti CACTUS 6 STIRR i a i D i j a j D j σ(a i,a j ) E[σ(a i,a j )] σ(a i,a j ) > αe[σ(a i,a j )] a i a j α > 1 S i D i S j D j S i S j S i S j S=S 1 S d 3 1 S i S j 2 S i 3 σ(s) αe[σ(s)] σ(s) S E[σ(S)] 2 D i D j 4 STIRR Gibson STIRR Sieving Through Iterated Relational Reinforcement [Gibson 98] 6 STIRR x i1,x i2,x i3 3 2 v v x τ
64 18 1 2003 1 v x i2 x i3 w(x τ ) w(x τ ) v v w(v) 5 STIRR Ding Mcut [Ding 01] C 1 C 2 cut(c 1,C 2 ) C 1 W (C 1 ) Mcut = cut(c 1,C 2 ) W (C 1 ) + cut(c 1,C 2 ) W (C 2 ) Mcut NP C 1 0 C 2 1 0 C 1 Mcut [Tsuda 96] Web [He 01] [ 00] [Dhillon 01] 7. 1 i θ i f i (x θ i ) α i > 0, k i α i = 1 Pr[x θ 1,...θ k ] = k α i f i (x θ i ) i f i (x θ i ) X EM [Dempster 77] k-means 4 kmeans Meilă [Meilă 01] EM [Cadez 00] [ 02] 2 AutoClass Maximum A-Posteriori; MAP) Cheeseman AutoClass [Cheeseman 96, Hanson 91] Web http://ic.arc.nasa.gov/ic/projects /bayes-group/autoclass/ Paliouras [Paliouras 00] WWW Auto- Class Kohonen [Kohonen 97] [ 2 ] [Cadez 00] Cadez, I. V., Gaffney, S., and Smyth, P.: A General Probabilistic Framework for Clustering Individuals and Objects, in Proc. of The 6th Int l Conf. on Knowledge Discovery and Data Mining, pp. 140 149 (2000) [Cheeseman 96] Cheeseman, P. and Stutz, J.: Bayesian Classification (AutoClass): Theory and Results, in Fayyad, U. M., Diatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. eds., Advances in Knowledge Discovery and Data Mining, chapter 6, pp. 153 180, AAAI Press/The MIT Press (1996) [Cutting 92] Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, in Proc. of the 15th Annual ACM SIGIR Conf. on Research and
1 65 Development in Information Retrieval, pp. 318 329 (1992) [Dempster 77] Dempster, A. P., Laird, N. M., and Rubin, D. B.: Maximum Likelihood from Incomplete Data via The EM Algorithm, Journal of the Royal Statistical Society (B), Vol. 39, No. 1, pp. 1 38 (1977) [Dhillon 01] Dhillon, I. S.: Co-clustering documents and words using Bipartite Spectral Graph Partitioning, in Proc. of The 7th Int l Conf. on Knowledge Discovery and Data Mining, pp. 269 274 (2001) [Ding 01] Ding, C. H. Q., He, X., Zha, H., Gu, M., and Simon, H. D.: A Min-max Cut Algorithm for Graph Partitioning and Data Clustering, in Proc. of the IEEE Int l Conf. on Data Mining, pp. 107 114 (2001) [Dubes 79] Dubes, R. and Jain, A. K.: Validity Studies in Clustering Methodologies, Pattern Recognition, Vol. 11, pp. 235 254 (1979) [Everitt 93] Everitt, B. S.: Cluster Analysis, Edward Arnold, third edition (1993) [Fayyad 96] Fayyad, U. M., Piatetsky-Shapiro, G., and Smyth, P.: From Data Mining to Knowledge Discovery: An Overview, in Fayyad, U. M., Diatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. eds., Advances in Knowledge Discovery and Data Mining, chapter 1, pp. 1 34, AAAI Press/The MIT Press (1996) [Fisher 87] Fisher, D. H.: Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, Vol. 2, pp. 139 172 (1987) [Fisher 91] Fisher, D. H. and Pazzani, M. J.: Computational Models of Concept Learning, in Fisher, D. H., Pazzani, M. J., and Langley, P. eds., Concept Formation: Knowledge and Experience in Unsupervised Learning, chapter 1, pp. 3 43, Morgan Kaufmann (1991) [ 01],,,, 3, (2001) [Ganti 99] Ganti, V., Gehrke, J., and Ramakrishnan, R.: CACTUS Clustering Categorical Data Using Summaries, in Proc. of The 5th Int l Conf. on Knowledge Discovery and Data Mining, pp. 73 83 (1999) [Gibson 98] Gibson, D., Kleinberg, J., and Raghavan, P.: Clustering Categorical Data: An Approach Based on Dynamical Systems, in Proc. of the 24th Very Large Database Conf., pp. 311 322 (1998) [Guha 98] Guha, S., Rastogi, R., and Shim, K.: CURE: An Efficient Clustering Algorithm for Large Databases, in Proc. of the ACM SIGMOD Int l Conf. on Management of Data, pp. 73 80 (1998) [Guha 99] Guha, S., Rastogi, R., and Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes, in Proc. of the 15th Int l Conf. on Data Engineering, pp. 512 521 (1999) [Hanson 91] Hanson, R., Stutz, J., and Cheeseman, P.: Bayesian Classification with Correlation and Inheritance, in Proc. of the 12th Int l Joint Conf. on Artificial Intelligence, pp. 692 698 (1991) [He 01] He, X., Ding, C. H. Q., Zha, H., and Simon, H. D.: Automatic Topic Identification Using Webpage Clustering, in Proc. of the IEEE Int l Conf. on Data Mining, pp. 195 202 (2001) [Hinneburg 99] Hinneburg, A. and Keim, D. A.: Clustering Methods for Large Databases: From the Past to the Future, in Proc of the ACM SIGMOD Int l Conf. on Management of Data, p. 509 (1999) [Huang 98] Huang, Z.: Extensions to the k-means Algorithm for Clustering Large Data with Categorical Values, Journal of Data Mining and Knowledge Discovery, Vol. 2, pp. 283 304 (1998) [ 00],, D-II, Vol. J83-D-II, No. 3, pp. 957 966 (2000) [ 98],,,, (1998) [Jain 88] Jain, A. K. and Dubes, R. C.: Algorithms for Clustering Data, Prentice Hall (1988) [Jain 99] Jain, A. K., Murty, M. N., and Flynn, P. J.: Data Clustering: A Review, ACM Computing Surveys, Vol. 31, No. 3 (1999) [Jain 00] Jain, A. K., Duin, R. P. W., and Mao, J.: Statistical Pattern Recognition: A Review, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 22, No. 1, pp. 4 37 (2000) [Keim 99] Keim, D. A. and Hinneburg, A.: Tutorial 3. Clustering Techniques for Large Data Sets From the Past to the Future, in Tutorial Notes of The 5th Int l Conf. on Knowledge Discovery and Data Mining, pp. 141 181 (1999) [Kohonen 97] Kohonen, T.: Self-Organizing Maps, Springer- Verlag, second edition (1997) [Meilă 01] Meilă, M. and Heckerman, D.: An Experimental Comparison of Model-Based Clustering Methods, Machine Learning, Vol. 42, No. 9-29 (2001) [Michalski 83] Michalski, R. S. and Stepp, R. E.: Learning from Observataion: Conceptual Clustering, in Michalski, R. S., Carbonell, J. G., and Mitchell, T. M. eds., Machine Learning I: An Artificial Intelligence Approach, chapter 11, pp. 331 363, Morgan Kaufmann (1983) [Milligan 85] Milligan, G. W. and Cooper, M. C.: An Examination of Procedures for Determining The Number of Clusters in A Data Set, Psychometrika, Vol. 50, No. 2, pp. 159 179 (1985) [ 99], (1999) [ 85],, Vol. 24, No. 11, pp. 999 1006 (1985) [Paliouras 00] Paliouras, G., Papatheodorou, C., and Karkaletsis, V.: Clustering the Users of Large Web Sites into Communities, in Proc. of the 17th Int l Conf. on Machine Learning, pp. 719 726 (2000) [ 02],!!,, Vol. 43, No. 5, pp. 562 567 (2002) [ 96],, Vol. 8, No. 3, pp. 463 467 (1996) [Tsuda 96] Tsuda, K., Minoh, M., and Ikeda, K.: Extracting straight lines by sequential fuzzy clustering, Pattern Recognition Letters, Vol. 17, pp. 643 549 (1996) [ 02],, 5, pp. 196 201 (2002) 1968 1992 1994 2001 ( ) ACM