An Effective and Efficient Algorithm for Text Categorization

,2 2 2. 300074 2. 30007 shi.zw@mail.nankai.edu.cn k k k k TP8 A An Effective and Efficient Algorithm for Text Categorization Shi Zhi-wei,2, Liu Tao 2, Wu Gong-yi 2 (.College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300074; 2.Department of Information Science, Nankai University, Tianjin 30007) Abstract In recent years, spam has become one of the severe problems that disturb us. One effective solution to this problem is content-based email filter using text categorization (TC) method. Classical TC methods often stress effectiveness rather than efficiency, so they can not perfectly serve the email filtering application which requires both effectiveness and efficiency. In this paper we discuss two popular algorithms for text categorization: Vector Space Model (VSM) and k Nearest Neighbor (knn). The former is a simple and fast algorithm, but its precision is often not satisfying. On the contrary, the latter spends much time determining the class label of a query document, but often gains better categorization performance. We propose a new algorithm, hybrid of VSM and knn, by combining the strength of these two algorithms effectively. We also perform an experimental evaluation of the effectiveness of our algorithm. The result of our experiment demonstrates that the new algorithm achieves a competitive (or even better) performance to the well-known algorithm knn at the cost of much less computation. Key words text categorization VSM knn. Web

[] [2] [3] d i = (w i, w i2,, w in ) d n [4] k Bayes tf-idf [9] [0] Vector Space Model n wik w jk k= Sim ( di, d j ) = () [4, 5] n n k 2 2 ( wik )( w jk ) k = k = [4] k Centroid k [6] KD- R- R*- nearest neighbor m C = { c i } i= [6] c i v(c i ) k d q k () m Sim(d q, v(c i )), i =,, m VSM k knn cˆ ( dq) = arg max Sim( dq, v( ci )) ci C k Hybrid of VSM and knn k 2. N O(N) VSM k knn m O(m) 2 VSM 2 2 k knn 40 k Salton [7] Smart [8] Smart

k [] k k k k m C = { c i } i= c(d) d k 2 Ginois [2] hash 64 2 5 Arya [3] (d, c(d)) training_examples ε 3 d q d,, d k training_examples d q k k 2 k cˆ k ( dq) = arg max δ ( ci, c( d j )) ci C j = k δ(a, b) = a = b δ(a, b) = 0 k 2 k 3. k k Hybrid of VSM and knn k k O(N) 2 3 k baseline 3 3 +

Area, Area Area Area O(N) N Area c i Area v(c i ) B i. Area Area k k d q Area { B m i} i= VSM k knn Area Area 4 VSM knn m Area k O(mN) m Area Area O(m) k O(m+R) R 3 t t >> m k N/(m+R) m C = { c i } i= m+ m O(N) O(mt) k - O(Nt) m O(mN) O((m+R)t) k 4. 4 MinSim 2 MinSimi = max Sim( d, v( ci )), i =, Λ, m (2) c( d ) ci, d D Reuters-2578 20 Newsgroups 2 D c(d) d Reuters-2578 Reuters- c i B i 2578 n Bi = { x R Sim( x, v( ci )) > MinSimi } i =, Λ, m (3) 2578 k 4 0 4596 3 2 http://www.daviddlewis.com/resources/testcollections/ 2 http://www.ai.mit.edu/people/jrennie/20newsgroups/

F- 20 Newsgroups 20NG Newsgroups 2 ( β + ) pr Ken Fβ ( r, p) = (6) 2 β p + r Lang 20 β p r β 9997 F 8828 5 macro_average comp.graphics comp.os.mswindows.misc comp.sys.ibm.pc.hardware micro_average comp.sys.mac.hardware comp.windows.x 488 4 2 0 6 Smart F stemming ltc TF IDF 4 5 4 3 2 3 VSM k knn Hybrid k 2 Reuters-2578 0- k 0 0 0 2 Reuters-2578 k k k k 0 k 30 k 50 4.5 0.6395 0.8788 0.8934 F 0.5336 0.8229 0.8264 4 4 0.5828 0.809 0.8266 F 0.5039 0.7988 0.8082 a 3 20 Newsgroups b c k p r 0.4866 0.8390 0.8377 F r = a / (a + c), if a + c > 0; otherwise r = (4) 0.4555 0.8405 0.8392 0.4887 0.8390 0.8379 p = a / (a + b), if a + b > 0; otherwise p = (5) F 0.4300 0.8388 0.8376

k 5. k Hybrid of VSM and knn 2 3 [] Mark Craven., Dipasquo, A., Freitag, A., et al., Learning to extract symbolic knowledge from the World Wide Web, [A] In Proc of the Fifteenth National Conf. on Artificial Intelligence (AAAI - 98), Wisconsin, 998, pp. 509-56. [2] David D. Lewis, Knowles K. A., Threading electronic mail: A preliminary study, [A] Information Processing and Management, 33(2), 997, pp. 209-27. [3] Ken Lang, News Weeder: Learning to filter net news, [A] In Int. Conf. on Machine Learning(ICML), California, 995, pp. 33-339. [4] Yang Yi-ming, An evaluation of statistical approach to text categorization, [A] In Technical Report CMU-CS-97-27, Computer Science Department, Carnegie Mellon University, 997. [5] Susan Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive learning algorithms and representations for text categorization," [A] In Proc. ACM-Conf. Information and Knowledge Management (CIKM98), Nov 998, pp. 48--55. [6] Roger Weber, H.-J. Schek, S. Blott, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, [A] Proceedings of the 24rd International Conference on Very Large Data Bases, pp. 94-205. [8] Gerald Salton, Automatic information organization and retrieval, Addison-Wesley, Reading PA, 968. [9] Gerald Salton and Buckley, C., Term weighting approaches in automatic text retrieval, [A] In Information Processing & Management, vol. 24, no. 5, 988, pp. 53-523. [0] Amit Singhal, AT&T at TREC-6, [A] In The Sixth Text REtrieval Conf (TREC-6), NIST SP 500-240, 998, pp. 25-225. [] Lu Yuchang, Lu Mingyu, Li Fan & Zhou lizhu, Analysis and construction of word weighting function in vsm, [J] Journal of Computer Research & Development. (in Chinese) 2002,39(0), pp. 205~20,,,.., 2002, 39(0): 205-20 [2] Aristides Gionis, P. Indyk, and R. Motwani, Similarity Search in High Dimensions via Hashing, [J] The {VLDB} Journal, 999, pp. 58-529 [3] Sunil Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. "An optimal algorithm for approximate nearest neighbor searching fixed dimensions", [J] Journal of the ACM, 45(6), 998, pp. 89 923. [7] Gerald Salton, Wong, A., and Yang, C. S., A vector space model for automatic indexing, [J] Comm. ACM 8(), 975, pp. 63-620.