An Effective and Efficient Algorithm for Text Categorization

Σχετικά έγγραφα
ER-Tree (Extended R*-Tree)

Quick algorithm f or computing core attribute

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1


A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Indexing Methods for Encrypted Vector Databases

Reading Order Detection for Text Layout Excluded by Image

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

Newman Modularity Newman [4], [5] Newman Q Q Q greedy algorithm[6] Newman Newman Q 1 Tabu Search[7] Newman Newman Newman Q Newman 1 2 Newman 3

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

Twitter 6. DEIM Forum 2014 A Twitter,,, Wikipedia, Explicit Semantic Analysis,

Research on Economics and Management

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Yoshifumi Moriyama 1,a) Ichiro Iimura 2,b) Tomotsugu Ohno 1,c) Shigeru Nakayama 3,d)


Buried Markov Model Pairwise

Assalamu `alaikum wr. wb.

Web 論 文. Performance Evaluation and Renewal of Department s Official Web Site. Akira TAKAHASHI and Kenji KAMIMURA

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

n 1 n 3 choice node (shelf) choice node (rough group) choice node (representative candidate)

A research on the influence of dummy activity on float in an AOA network and its amendments

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

ΑΠΟΔΟΤΙΚΗ ΑΠΟΤΙΜΗΣΗ ΕΡΩΤΗΣΕΩΝ OLAP Η ΜΕΤΑΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΕΞΕΙΔΙΚΕΥΣΗΣ. Υποβάλλεται στην

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

ΕΘΝΙΚΗ ΥΟΛΗ ΔΗΜΟΙΑ ΔΙΟΙΚΗΗ ΙH ΕΚΠΑΙΔΕΤΣΙΚΗ ΕΙΡΑ ΤΜΗΜΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΟΙΚΗΣΗΣ ΔΙΟΙΚΗΣΗ ΜΟΝΑΔΩΝ ΥΓΕΙΑΣ ΤΕΛΙΚΗ ΕΡΓΑΣΙΑ

Feasible Regions Defined by Stability Constraints Based on the Argument Principle

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

ΕΘΝΙΚΟ ΚΑΙ ΚΑΠΟΔΙΣΤΡΙΑΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΣΧΟΛΗ ΘΕΤΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΠΙΚΟΙΝΩΝΙΩΝ

Development of a Seismic Data Analysis System for a Short-term Training for Researchers from Developing Countries

Research of Han Character Internal Codes Recognition Algorithm in the Multi2lingual Environment

Πανεπιστήμιο Πειραιώς Τμήμα Πληροφορικής Πρόγραμμα Μεταπτυχιακών Σπουδών «Πληροφορική»

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

Liner Shipping Hub Network Design in a Competitive Environment

2. N-gram IDF. DEIM Forum 2016 A1-1. N-gram IDF IDF. 5 N-gram. N-gram. N-gram. N-gram IDF.

ΔΘΝΙΚΗ ΥΟΛΗ ΓΗΜΟΙΑ ΓΙΟΙΚΗΗ ΚΑ ΔΚΠΑΙΓΔΤΣΙΚΗ ΔΙΡΑ ΣΔΛΙΚΗ ΔΡΓΑΙΑ

Speeding up the Detection of Scale-Space Extrema in SIFT Based on the Complex First Order System

Ελαφρές κυψελωτές πλάκες - ένα νέο προϊόν για την επιπλοποιία και ξυλουργική. ΒΑΣΙΛΕΙΟΥ ΒΑΣΙΛΕΙΟΣ και ΜΠΑΡΜΠΟΥΤΗΣ ΙΩΑΝΝΗΣ

No. 7 Modular Machine Tool & Automatic Manufacturing Technique. Jul TH166 TG659 A

ΕΘΝΙΚΗ ΣΧΟΛΗ ΗΜΟΣΙΑΣ ΙΟΙΚΗΣΗΣ

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Optimization Investment of Football Lottery Game Online Combinatorial Optimization

Approximation of distance between locations on earth given by latitude and longitude

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

ΠΩΣ ΕΠΗΡΕΑΖΕΙ Η ΜΕΡΑ ΤΗΣ ΕΒΔΟΜΑΔΑΣ ΤΙΣ ΑΠΟΔΟΣΕΙΣ ΤΩΝ ΜΕΤΟΧΩΝ ΠΡΙΝ ΚΑΙ ΜΕΤΑ ΤΗΝ ΟΙΚΟΝΟΜΙΚΗ ΚΡΙΣΗ

High order interpolation function for surface contact problem

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n

Stabilization of stock price prediction by cross entropy optimization

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

CorV CVAC. CorV TU317. 1

ΤΕΧΝΟΛΟΓΙΚΟ ΕΚΠΑΙΔΕΥΤΙΚΟ ΙΔΡΥΜΑ ΗΡΑΚΛΕΙΟ ΚΡΗΤΗΣ ΣΧΟΛΗ ΔΙΟΙΚΗΣΗΣ ΚΑΙ ΟΙΚΟΝΟΜΙΑΣ ΤΜΗΜΑ ΛΟΓΙΣΤΙΚΗΣ

Study on the Strengthen Method of Masonry Structure by Steel Truss for Collapse Prevention

Other Test Constructions: Likelihood Ratio & Bayes Tests

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Δυσκολίες που συναντούν οι μαθητές της Στ Δημοτικού στην κατανόηση της λειτουργίας του Συγκεντρωτικού Φακού

Numerical Analysis FMN011

*,* + -+ on Bedrock Bath. Hideyuki O, Shoichi O, Takao O, Kumiko Y, Yoshinao K and Tsuneaki G

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

HOSVD. Higher Order Data Classification Method with Autocorrelation Matrix Correcting on HOSVD. Junichi MORIGAKI and Kaoru KATAYAMA

ΕΥΘΑΛΙΑ ΚΑΜΠΟΥΡΟΠΟΥΛΟΥ

ΣΤΥΛΙΑΝΟΥ ΣΟΦΙΑ

Gemini, FastMap, Applications. Εαρινό Εξάμηνο Τμήμα Μηχανικών Η/Υ και Πληροϕορικής Πολυτεχνική Σχολή, Πανεπιστήμιο Πατρών

Simplex Crossover for Real-coded Genetic Algolithms

ΖΩΝΟΠΟΙΗΣΗ ΤΗΣ ΚΑΤΟΛΙΣΘΗΤΙΚΗΣ ΕΠΙΚΙΝΔΥΝΟΤΗΤΑΣ ΣΤΟ ΟΡΟΣ ΠΗΛΙΟ ΜΕ ΤΗ ΣΥΜΒΟΛΗ ΔΕΔΟΜΕΝΩΝ ΣΥΜΒΟΛΟΜΕΤΡΙΑΣ ΜΟΝΙΜΩΝ ΣΚΕΔΑΣΤΩΝ

Διπλωματική Εργασία. Μελέτη των μηχανικών ιδιοτήτων των stents που χρησιμοποιούνται στην Ιατρική. Αντωνίου Φάνης

Automatic extraction of bibliography with machine learning

ΓΕΩΜΕΣΡΙΚΗ ΣΕΚΜΗΡΙΩΗ ΣΟΤ ΙΕΡΟΤ ΝΑΟΤ ΣΟΤ ΣΙΜΙΟΤ ΣΑΤΡΟΤ ΣΟ ΠΕΛΕΝΔΡΙ ΣΗ ΚΤΠΡΟΤ ΜΕ ΕΦΑΡΜΟΓΗ ΑΤΣΟΜΑΣΟΠΟΙΗΜΕΝΟΤ ΤΣΗΜΑΣΟ ΨΗΦΙΑΚΗ ΦΩΣΟΓΡΑΜΜΕΣΡΙΑ

ΠΑΝΔΠΗΣΖΜΗΟ ΠΑΣΡΩΝ ΣΜΖΜΑ ΖΛΔΚΣΡΟΛΟΓΩΝ ΜΖΥΑΝΗΚΩΝ ΚΑΗ ΣΔΥΝΟΛΟΓΗΑ ΤΠΟΛΟΓΗΣΩΝ ΣΟΜΔΑ ΤΣΖΜΑΣΩΝ ΖΛΔΚΣΡΗΚΖ ΔΝΔΡΓΔΗΑ

Statistical Inference I Locally most powerful tests

Η ΠΡΟΣΩΠΙΚΗ ΟΡΙΟΘΕΤΗΣΗ ΤΟΥ ΧΩΡΟΥ Η ΠΕΡΙΠΤΩΣΗ ΤΩΝ CHAT ROOMS

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

Web DEIM Forum 2009 A7-1. Web. Web. Web. Web. 4 Wikipedia. Wikipedia. Web.

ΓΕΩΠΟΝΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΤΜΗΜΑ ΑΓΡΟΤΙΚΗΣ ΟΙΚΟΝΟΜΙΑΣ & ΑΝΑΠΤΥΞΗΣ

derivation of the Laplacian from rectangular to spherical coordinates

Ανάκτηση Εικόνας βάσει Υφής με χρήση Eye Tracker

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Μεταπτυχιακή διατριβή. Ανδρέας Παπαευσταθίου

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

Approximation Expressions for the Temperature Integral

Μηχανισμοί πρόβλεψης προσήμων σε προσημασμένα μοντέλα κοινωνικών δικτύων ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

Δθαξκνζκέλα καζεκαηηθά δίθηπα: ε πεξίπησζε ηνπ ζπζηεκηθνύ θηλδύλνπ ζε κηθξνεπίπεδν.

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

ΔΙΠΛΩΜΑΤΙΚΕΣ ΕΡΓΑΣΙΕΣ

Partial Differential Equations in Biology The boundary element method. March 26, 2013

c Key words: cultivation of blood, two-sets blood culture, detection rate of germ Vol. 18 No

Toward a SPARQL Query Execution Mechanism using Dynamic Mapping Adaptation -A Preliminary Report- Takuya Adachi 1 Naoki Fukuta 2.

* ** *** *** Jun S HIMADA*, Kyoko O HSUMI**, Kazuhiko O HBA*** and Atsushi M ARUYAMA***

ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. ΘΕΜΑ: «ιερεύνηση της σχέσης µεταξύ φωνηµικής επίγνωσης και ορθογραφικής δεξιότητας σε παιδιά προσχολικής ηλικίας»

Τεχνολογία Ψυχαγωγικού Λογισμικού και Εικονικοί Κόσμοι Ενότητα 8η - Εικονικοί Κόσμοι και Πολιτιστικό Περιεχόμενο

Context-aware και mhealth

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

User Behavior Analysis for a Large2scale Search Engine

IMES DISCUSSION PAPER SERIES

Congruence Classes of Invertible Matrices of Order 3 over F 2

( ) , ) , ; kg 1) 80 % kg. Vol. 28,No. 1 Jan.,2006 RESOURCES SCIENCE : (2006) ,2 ,,,, ; ;

ΣΥΓΧΡΟΝΕΣ ΤΑΣΕΙΣ ΣΤΗΝ ΕΚΤΙΜΗΣΗ ΚΑΙ ΧΑΡΤΟΓΡΑΦΗΣΗ ΤΩΝ ΚΙΝΔΥΝΩΝ


Transcript:

,2 2 2. 300074 2. 30007 shi.zw@mail.nankai.edu.cn k k k k TP8 A An Effective and Efficient Algorithm for Text Categorization Shi Zhi-wei,2, Liu Tao 2, Wu Gong-yi 2 (.College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300074; 2.Department of Information Science, Nankai University, Tianjin 30007) Abstract In recent years, spam has become one of the severe problems that disturb us. One effective solution to this problem is content-based email filter using text categorization (TC) method. Classical TC methods often stress effectiveness rather than efficiency, so they can not perfectly serve the email filtering application which requires both effectiveness and efficiency. In this paper we discuss two popular algorithms for text categorization: Vector Space Model (VSM) and k Nearest Neighbor (knn). The former is a simple and fast algorithm, but its precision is often not satisfying. On the contrary, the latter spends much time determining the class label of a query document, but often gains better categorization performance. We propose a new algorithm, hybrid of VSM and knn, by combining the strength of these two algorithms effectively. We also perform an experimental evaluation of the effectiveness of our algorithm. The result of our experiment demonstrates that the new algorithm achieves a competitive (or even better) performance to the well-known algorithm knn at the cost of much less computation. Key words text categorization VSM knn. Web

[] [2] [3] d i = (w i, w i2,, w in ) d n [4] k Bayes tf-idf [9] [0] Vector Space Model n wik w jk k= Sim ( di, d j ) = () [4, 5] n n k 2 2 ( wik )( w jk ) k = k = [4] k Centroid k [6] KD- R- R*- nearest neighbor m C = { c i } i= [6] c i v(c i ) k d q k () m Sim(d q, v(c i )), i =,, m VSM k knn cˆ ( dq) = arg max Sim( dq, v( ci )) ci C k Hybrid of VSM and knn k 2. N O(N) VSM k knn m O(m) 2 VSM 2 2 k knn 40 k Salton [7] Smart [8] Smart

k [] k k k k m C = { c i } i= c(d) d k 2 Ginois [2] hash 64 2 5 Arya [3] (d, c(d)) training_examples ε 3 d q d,, d k training_examples d q k k 2 k cˆ k ( dq) = arg max δ ( ci, c( d j )) ci C j = k δ(a, b) = a = b δ(a, b) = 0 k 2 k 3. k k Hybrid of VSM and knn k k O(N) 2 3 k baseline 3 3 +

Area, Area Area Area O(N) N Area c i Area v(c i ) B i. Area Area k k d q Area { B m i} i= VSM k knn Area Area 4 VSM knn m Area k O(mN) m Area Area O(m) k O(m+R) R 3 t t >> m k N/(m+R) m C = { c i } i= m+ m O(N) O(mt) k - O(Nt) m O(mN) O((m+R)t) k 4. 4 MinSim 2 MinSimi = max Sim( d, v( ci )), i =, Λ, m (2) c( d ) ci, d D Reuters-2578 20 Newsgroups 2 D c(d) d Reuters-2578 Reuters- c i B i 2578 n Bi = { x R Sim( x, v( ci )) > MinSimi } i =, Λ, m (3) 2578 k 4 0 4596 3 2 http://www.daviddlewis.com/resources/testcollections/ 2 http://www.ai.mit.edu/people/jrennie/20newsgroups/

F- 20 Newsgroups 20NG Newsgroups 2 ( β + ) pr Ken Fβ ( r, p) = (6) 2 β p + r Lang 20 β p r β 9997 F 8828 5 macro_average comp.graphics comp.os.mswindows.misc comp.sys.ibm.pc.hardware micro_average comp.sys.mac.hardware comp.windows.x 488 4 2 0 6 Smart F stemming ltc TF IDF 4 5 4 3 2 3 VSM k knn Hybrid k 2 Reuters-2578 0- k 0 0 0 2 Reuters-2578 k k k k 0 k 30 k 50 4.5 0.6395 0.8788 0.8934 F 0.5336 0.8229 0.8264 4 4 0.5828 0.809 0.8266 F 0.5039 0.7988 0.8082 a 3 20 Newsgroups b c k p r 0.4866 0.8390 0.8377 F r = a / (a + c), if a + c > 0; otherwise r = (4) 0.4555 0.8405 0.8392 0.4887 0.8390 0.8379 p = a / (a + b), if a + b > 0; otherwise p = (5) F 0.4300 0.8388 0.8376

k 5. k Hybrid of VSM and knn 2 3 [] Mark Craven., Dipasquo, A., Freitag, A., et al., Learning to extract symbolic knowledge from the World Wide Web, [A] In Proc of the Fifteenth National Conf. on Artificial Intelligence (AAAI - 98), Wisconsin, 998, pp. 509-56. [2] David D. Lewis, Knowles K. A., Threading electronic mail: A preliminary study, [A] Information Processing and Management, 33(2), 997, pp. 209-27. [3] Ken Lang, News Weeder: Learning to filter net news, [A] In Int. Conf. on Machine Learning(ICML), California, 995, pp. 33-339. [4] Yang Yi-ming, An evaluation of statistical approach to text categorization, [A] In Technical Report CMU-CS-97-27, Computer Science Department, Carnegie Mellon University, 997. [5] Susan Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive learning algorithms and representations for text categorization," [A] In Proc. ACM-Conf. Information and Knowledge Management (CIKM98), Nov 998, pp. 48--55. [6] Roger Weber, H.-J. Schek, S. Blott, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, [A] Proceedings of the 24rd International Conference on Very Large Data Bases, pp. 94-205. [8] Gerald Salton, Automatic information organization and retrieval, Addison-Wesley, Reading PA, 968. [9] Gerald Salton and Buckley, C., Term weighting approaches in automatic text retrieval, [A] In Information Processing & Management, vol. 24, no. 5, 988, pp. 53-523. [0] Amit Singhal, AT&T at TREC-6, [A] In The Sixth Text REtrieval Conf (TREC-6), NIST SP 500-240, 998, pp. 25-225. [] Lu Yuchang, Lu Mingyu, Li Fan & Zhou lizhu, Analysis and construction of word weighting function in vsm, [J] Journal of Computer Research & Development. (in Chinese) 2002,39(0), pp. 205~20,,,.., 2002, 39(0): 205-20 [2] Aristides Gionis, P. Indyk, and R. Motwani, Similarity Search in High Dimensions via Hashing, [J] The {VLDB} Journal, 999, pp. 58-529 [3] Sunil Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. "An optimal algorithm for approximate nearest neighbor searching fixed dimensions", [J] Journal of the ACM, 45(6), 998, pp. 89 923. [7] Gerald Salton, Wong, A., and Yang, C. S., A vector space model for automatic indexing, [J] Comm. ACM 8(), 975, pp. 63-620.