Automatic Domain2Specific Term Extraction and Its Application in Text Cla ssification

Σχετικά έγγραφα
Quick algorithm f or computing core attribute

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

ER-Tree (Extended R*-Tree)

( ) , ) , ; kg 1) 80 % kg. Vol. 28,No. 1 Jan.,2006 RESOURCES SCIENCE : (2006) ,2 ,,,, ; ;

1530 ( ) 2014,54(12),, E (, 1, X ) [4],,, α, T α, β,, T β, c, P(T β 1 T α,α, β,c) 1 1,,X X F, X E F X E X F X F E X E 1 [1-2] , 2 : X X 1 X 2 ;

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Optimization, PSO) DE [1, 2, 3, 4] PSO [5, 6, 7, 8, 9, 10, 11] (P)

Toward a SPARQL Query Execution Mechanism using Dynamic Mapping Adaptation -A Preliminary Report- Takuya Adachi 1 Naoki Fukuta 2.

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

Approximation Expressions for the Temperature Integral

Area Location and Recognition of Video Text Based on Depth Learning Method

Buried Markov Model Pairwise

Reading Order Detection for Text Layout Excluded by Image

Automatic extraction of bibliography with machine learning

Η αλληλεπίδραση ανάμεσα στην καθημερινή γλώσσα και την επιστημονική ορολογία: παράδειγμα από το πεδίο της Κοσμολογίας

Web DEIM Forum 2009 A7-1. Web. Web. Web. Web. 4 Wikipedia. Wikipedia. Web.

(Υπογραϕή) (Υπογραϕή) (Υπογραϕή)

Detection and Recognition of Traffic Signal Using Machine Learning

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Adaptive grouping difference variation wolf pack algorithm

No. 7 Modular Machine Tool & Automatic Manufacturing Technique. Jul TH166 TG659 A

A research on the influence of dummy activity on float in an AOA network and its amendments

Twitter 6. DEIM Forum 2014 A Twitter,,, Wikipedia, Explicit Semantic Analysis,

SVM. Research on ERPs feature extraction and classification

Ερευνητική+Ομάδα+Τεχνολογιών+ Διαδικτύου+

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]


ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

Ζητήματα Τυποποίησης στην Ορολογία - ο ρόλος και οι δράσεις της Επιτροπής Ορολογίας ΤΕ21 του ΕΛΟΤ

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Wiki. Wiki. Analysis of user activity of closed Wiki used by small groups

CorV CVAC. CorV TU317. 1

Αξιολόγηση των εκπαιδευτικών δραστηριοτήτων των νοσοκομειακών βιβλιοθηκών.

Newman Modularity Newman [4], [5] Newman Q Q Q greedy algorithm[6] Newman Newman Q 1 Tabu Search[7] Newman Newman Newman Q Newman 1 2 Newman 3

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

ΘΕΜΑΤΙΚΗ ΕΥΡΕΤΗΡΙΑΣΗ ΚΑΙ ΚΑΘΙΕΡΩΣΗ ΟΡΟΛΟΓΙΑΣ ΣΤΙΣ ΤΕΧΝΙΚΕΣ ΒΙΒΛΙΟΘΗΚΕΣ: Η ΕΜΠΕΙΡΙΑ ΣΤΟ ΤΕΕ

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΕΙΡΑΙΩΣ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΠΜΣ «ΠΡΟΗΓΜΕΝΑ ΣΥΣΤΗΜΑΤΑ ΠΛΗΡΟΦΟΡΙΚΗΣ» ΚΑΤΕΥΘΥΝΣΗ «ΕΥΦΥΕΙΣ ΤΕΧΝΟΛΟΓΙΕΣ ΕΠΙΚΟΙΝΩΝΙΑΣ ΑΝΘΡΩΠΟΥ - ΥΠΟΛΟΓΙΣΤΗ»

ΕΥΘΑΛΙΑ ΚΑΜΠΟΥΡΟΠΟΥΛΟΥ

Congruence Classes of Invertible Matrices of Order 3 over F 2

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

The Application of Five Ne w Technologies in Intelligence Analysis

ΕΘΝΙΚΗ ΣΧΟΛΗ ΔΗΜΟΣΙΑΣ ΔΙΟΙΚΗΣΗΣ ΙΓ' ΕΚΠΑΙΔΕΥΤΙΚΗ ΣΕΙΡΑ

Εξόρυξη Γνώμης: Δημιουργία Ελληνικού Λεξικού Πόρου

Strategic management application for secondary school principals in Taif city from the agents and teachers point of view

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΓΕΩΠΟΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΒΙΟΤΕΧΝΟΛΟΓΙΑΣ ΚΑΙ ΕΠΙΣΤΗΜΗΣ ΤΡΟΦΙΜΩΝ. Πτυχιακή εργασία

Research of Han Character Internal Codes Recognition Algorithm in the Multi2lingual Environment

ΔΙΠΛΩΜΑΤΙΚΕΣ ΕΡΓΑΣΙΕΣ ΠΜΣ «ΠΛΗΡΟΦΟΡΙΚΗ & ΕΠΙΚΟΙΝΩΝΙΕΣ» OSWINDS RESEARCH GROUP

,,, (, ) , ;,,, ; -

-,,.. Fosnot. Tobbins Tippins -, -.,, -,., -., -,, -,.

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

Διαχείριση Έργων Πληροφορικής

Οντολογία Ψηφιακής Βιβλιοθήκης

Maxima SCORM. Algebraic Manipulations and Visualizing Graphs in SCORM contents by Maxima and Mashup Approach. Jia Yunpeng, 1 Takayuki Nagai, 2, 1

Research on model of early2warning of enterprise crisis based on entropy

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

Ψηφιακή ανάπτυξη. Course Unit #1 : Κατανοώντας τις βασικές σύγχρονες ψηφιακές αρχές Thematic Unit #1 : Τεχνολογίες Web και CMS

Τοποθέτηση τοπωνυµίων και άλλων στοιχείων ονοµατολογίας στους χάρτες

Estimation of stability region for a class of switched linear systems with multiple equilibrium points

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

Ψηφιακή ανάπτυξη. Course Unit #1 : Κατανοώντας τις βασικές σύγχρονες ψηφιακές αρχές Thematic Unit #1 : Τεχνολογίες Web και CMS

n 1 n 3 choice node (shelf) choice node (rough group) choice node (representative candidate)

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

( ) ( ) China Academic Journal Electronic Publishing House. All rights reserved.

Optimizing Microwave-assisted Extraction Process for Paprika Red Pigments Using Response Surface Methodology

ΔΙΠΛΩΜΑΤΙΚΕΣ ΕΡΓΑΣΙΕΣ ΠΜΣ «ΠΛΗΡΟΦΟΡΙΚΗ & ΕΠΙΚΟΙΝΩΝΙΕς» OSWINDS RESEARCH GROUP

Optimization Investment of Football Lottery Game Online Combinatorial Optimization

Gro wth Properties of Typical Water Bloom Algae in Reclaimed Water

A Method for Describing Coordination Problem Based on Coordination Knowledge Level

Χρήση οντολογιών στη χαρτογράφηση γνώσης: Μελέτη περίπτωσης σε μία ακαδημαϊκή βιβλιοθήκη

HOSVD. Higher Order Data Classification Method with Autocorrelation Matrix Correcting on HOSVD. Junichi MORIGAKI and Kaoru KATAYAMA

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Computational study of the structure, UV-vis absorption spectra and conductivity of biphenylene-based polymers and their boron nitride analogues

Study on the Strengthen Method of Masonry Structure by Steel Truss for Collapse Prevention

VSC STEADY2STATE MOD EL AND ITS NONL INEAR CONTROL OF VSC2HVDC SYSTEM VSC (1. , ; 2. , )

Probabilistic Approach to Robust Optimization

ΠΑΝΔΠΗΣΖΜΗΟ ΠΑΣΡΩΝ ΣΜΖΜΑ ΖΛΔΚΣΡΟΛΟΓΩΝ ΜΖΥΑΝΗΚΩΝ ΚΑΗ ΣΔΥΝΟΛΟΓΗΑ ΤΠΟΛΟΓΗΣΩΝ ΣΟΜΔΑ ΤΣΖΜΑΣΩΝ ΖΛΔΚΣΡΗΚΖ ΔΝΔΡΓΔΗΑ

{takasu, Conditional Random Field

ΟΡΓΑΝΙΣΜΟΣ ΒΙΟΜΗΧΑΝΙΚΗΣ ΙΔΙΟΚΤΗΣΙΑΣ

A summation formula ramified with hypergeometric function and involving recurrence relation

ΕΠΙΧΕΙΡΗΣΙΑΚΗ EΡΕΥΝΑ & ΔΙΟΙΚΗΤΙΚΗ ΕΠΙΣΤΗΜΗ OPERATIONS RESEARCH & MANAGEMENT SCIENCE

Δημιουργία Λογαριασμού Διαχείρισης Business Telephony Create a Management Account for Business Telephony

2. N-gram IDF. DEIM Forum 2016 A1-1. N-gram IDF IDF. 5 N-gram. N-gram. N-gram. N-gram IDF.

«ΑΝΑΠΣΤΞΖ ΓΠ ΚΑΗ ΥΩΡΗΚΖ ΑΝΑΛΤΖ ΜΔΣΔΩΡΟΛΟΓΗΚΩΝ ΓΔΓΟΜΔΝΩΝ ΣΟΝ ΔΛΛΑΓΗΚΟ ΥΩΡΟ»

ΕΦΑΡΜΟΓΗ ΕΥΤΕΡΟΒΑΘΜΙΑ ΕΠΕΞΕΡΓΑΣΜΕΝΩΝ ΥΓΡΩΝ ΑΠΟΒΛΗΤΩΝ ΣΕ ΦΥΣΙΚΑ ΣΥΣΤΗΜΑΤΑ ΚΛΙΝΗΣ ΚΑΛΑΜΙΩΝ

Anomaly Detection with Neighborhood Preservation Principle

LUO, Hong2Qun LIU, Shao2Pu Ξ LI, Nian2Bing

ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ

Optimization Investment of Football Lottery Game Online Combinatorial Optimization

Δυσκολίες που συναντούν οι μαθητές της Στ Δημοτικού στην κατανόηση της λειτουργίας του Συγκεντρωτικού Φακού

J. of Math. (PRC) Banach, , X = N(T ) R(T + ), Y = R(T ) N(T + ). Vol. 37 ( 2017 ) No. 5

Μηχανισμοί πρόβλεψης προσήμων σε προσημασμένα μοντέλα κοινωνικών δικτύων ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

Correction of chromatic aberration for human eyes with diffractive-refractive hybrid elements

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n

ΑΚΑΔΗΜΙΑ ΕΜΠΟΡΙΚΟΥ ΝΑΥΤΙΚΟΥ ΜΑΚΕΔΟΝΙΑΣ ΣΧΟΛΗ ΜΗΧΑΝΙΚΩΝ

46 2. Coula Coula Coula [7], Coula. Coula C(u, v) = φ [ ] {φ(u) + φ(v)}, u, v [, ]. (2.) φ( ) (generator), : [, ], ; φ() = ;, φ ( ). φ [ ] ( ) φ( ) []

Resurvey of Possible Seismic Fissures in the Old-Edo River in Tokyo

Transcript:

2 2007 2 ACTA ELECTRONICA SINICA Vol. 35 No. 2 Feb. 2007,,, (, 150001) :,,,,..,,,. : ; ; ; ; : TP39112 : A : 037222112 (2007) 0220328205 Automatic Domain2Specific Term Extraction and Its Application in Text Cla ssification LIU Tao,LIU Bing2quan,XU Zhi2ming,WANG Xiao2long ( School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China) Abstract : A statistical method based on information entropy is proposed for domain2specific term extraction from domain comparative corpora. It takes into account the distribution of a candidate word among domains and within a certain domain. Normal2 ization step is added into the extraction process to cope with unbalanced corpora. The proposed method characterizes attributes of do2 main2specific term more precisely and more effectively than previous term extraction approaches. Domain2specific terms are applied in text classification as the feature space. Experimental results indicate that it achieves better performance than traditional feature se2 lection methods. Key words : domain2specific term ;information entropy ; normalization ;text classification ;feature selection 1 [1 ],. [2,3 ] [4 ] [5 ].,,,... [4 ],.,,. [5,6 ] [2,3 ]. Jianfeng Gao [5 ] Henri Avancini [6 ]. [7 ] bootstrapping.. TFIDF [3 ] KFIDF [8 ] DR + DC [2 ]. KFIDF TFIDF,. DR DC,,.,., [9 ],. (1). (2).,..,. :2005210221 ; :2006211223 : (No. 60673037)

2 : 329 2 211, : m : D i (1 i m) :i n i (1 i m) : D i P( D i W) : W D i d ij (1 j n i ) : D i j l ij : d ij, L i : D i WS Di : D i WS rel : WS irre : WS : WS, WS rel WS irre = WS, WS rel WS irre = g, WS Di Α WS rel,, WS D1, WS D2,, WS Dm WS rel. WS. 212, [13 ] corpus distribution(cd) : CD ( W) = - P( D i W) log P( D i W) (1) i =1 CD( W), W. ( P ( D i W) ) [2 ], W P ( D i W),W,.,,P( D i W), W CD( W). 2003 863 : A( ) 015, A ( ) D( ), 11, A (D ),A.,.,,,,. W :, NCD ( W) = - m P ( D i W) log P ( D i W) (2) i =1 P ( D i W) = 213 m P( D j W) / L i j =1 ( P( D j W) / L j ) NCD W.. [2],,NDD ( W, D i ) W D i : NDD ( W, D i ) = - (3) n i P ( d ij W) log P ( d ij W) (4) j =1 P( d ij W) / l ij P ( d ij W) = n (5) i P( d ik W) / l ik k =1 NDD ( W, D i ), W D i. W D i,d i, W,. G( ),,, G.,,.. 214 : D,NCD,NDD : (1) for i = 1 to m do (2) for j = 1 to n i do (3) d ij W WS (4) end for (5) L i (6) end for, l ij 1 (7) for all ( W WS) do (8) NCD ( W) (9) if NCD ( W) <then (10) x = arg max ( P ( D x W) ) X (11) NDD ( W, D x ) (12) if NDD ( W, D x ) <then (13) W WS Di (14) end for 3.,.

330 2007. [10 ] K, Okapi,,.,,.,,. [11 ] : (TFIDF) ( ECE) 2 (CHI) (MI) ( IG) (WE) (DF). TFIDF DF,,.,.,.,RS( W, D x ) W D x,m D x n x NCD NDD. 015(5). RS( W, D x ) = - NCD ( W) / logm + (1 - ) NDD ( W, D x ) / log n x (6) 4 411 2003 2004 863,, T Z, 36, 100., (1), (2).,. 412 2003, (215) (015). 1.,,( ). R 1 TS 2, P ( D W) 015,NCD 2.,,. NCD,., ( ),, NCD,.. R 2 TS

2 : 331 R 3 D A TS 3 NCD NDD., 40.,,,., R TS, ( ), 100,,.,. 413,. [2 ] :,.,,. NCD + NDD DR + DC [2 ]. DR + DC (0135,0149), NCI + NDI,. 4. DR + DC, B 1776,. NCD + NDD,. (7) [12 ] TW DR + DC,3 4 200 200, 4 DR + DC NCD + NDD B 88830 1776 881 E 41030 621 677 H 38666 638 741 R 18182 444 571 TD 27925 318 162 TS 21792 257 358, DR + DC,, DR + DC. 414,,. 5 (7) F1. 5, NCD + NDD CHI DR + DC F1 419 318. 5 F1 MI 0. 419 0. 409 0. 414 DF 0. 556 0. 529 0. 542 WE 0. 564 0. 541 0. 552 IG 0. 559 0. 546 0. 552 TFIDF 0. 596 0. 572 0. 584 ECE 0. 617 0. 597 0. 607 KFIDF 0. 616 0. 601 0. 608 CHI 0. 633 0. 602 0. 617 DR + DC 0. 631 0. 626 0. 628 NCD + NDD 0. 663 0. 669 0. 666 5.,,

332 2007,. NCD + NDD,.. : [1 ] Boguraev B, Kennedy C. Applications of term identification technology : domain description and content characterisation [J ]. Natural Language Engineering,1999,5 (1) :17-44. [2] Velardi P,Missikoff M,et al. Identification of relevant terms to support the construction of domain ontologies [ A ]. Proceedings of the Workshop on Human Language Technologies and Knowledge Management[ C ]. France :ACM Press,2001. 1-8. [3 ] Maedche A, Staab S. Ontology learning. Handbook on Ontolo2 gies in Information Systems [ M ]. Heidelberg : Springer2Verlag, 2004. 173-190. [4] Oakes M P, Paice C. Term extraction for automatic abstracting. Recent Advances in Computational Terminology [ M ]. Amster2 dam/ Philadelphia :J ohn Benjamins Publishing Company, 2001. 353-370. [ 5 ] Gao J, Goodman J, et al. The use of clustering techniques for language modeling2application to Asian language[j ]. Computa2 tional Linguistics and Chinese Language Processing, 2001, 6 (1) :27-60. [6 ] Avancini H, Lavelli A, et al. Expanding domain2specific lexi2 cons by term categorization [ A ]. Proceedings of 18th ACM Symposium on Applied Computing[ C ]. US :ACM Press,2003. 793-797. [7 ],. Bootstrapping [ A ]. [ C ]. :,2003. 67-72. [8 ] Xu F, Kurz D, et al. A domain adaptive approach to automatic acquisition of domain relevant terms and their relations with bootstrapping[ A ]. Proceedings of the 3rd International Confer2 ence on Language Resources and Evaluation [ C ]. Spain :LREC press,2002. 224-230. [ 9 ] Liu T, Wang X L,et al. Domain2specific term extraction and its application in text classification [ A ]. Proceedings of 8th J oint Conference on Information Sciences [ C ]. USA : World Scientific Press,2005. 1481-1484. [10 ] Wang Q,Wang X L,et al. A study of semi2discrete matrix de2 composition for LSI in automated text categorization[ A ]. Pro2 ceeding of 1st International Joint Conference on Natural Lan2 guage Processing[ C]. China :Springer2Verlag,2004. 606-615. [11 ] Yang Y, Pedersen J O. A comparative study on feature selec2 tion in text categorization[ A ]. Proceeding of 14th International Conference on Machine Learning[ C ]. US :AAAI press,1997. 412-420. [12 ] Navigli R,Verlardi P. Learning domain ontologies from docu2 ment warehouses and dedicated web sites [ J ]. Computational Linguistics,2004,30 (2) :151-179. [13 ] O Duda R,et al. Pattern classification (Second Edition) [ M ]. Beijing : China Machine Press,2003. 321-322. :,1981... E2mail :tliu @insun. hit. edu. cn,1970,. Web.,1967,..