2 2007 2 ACTA ELECTRONICA SINICA Vol. 35 No. 2 Feb. 2007,,, (, 150001) :,,,,..,,,. : ; ; ; ; : TP39112 : A : 037222112 (2007) 0220328205 Automatic Domain2Specific Term Extraction and Its Application in Text Cla ssification LIU Tao,LIU Bing2quan,XU Zhi2ming,WANG Xiao2long ( School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China) Abstract : A statistical method based on information entropy is proposed for domain2specific term extraction from domain comparative corpora. It takes into account the distribution of a candidate word among domains and within a certain domain. Normal2 ization step is added into the extraction process to cope with unbalanced corpora. The proposed method characterizes attributes of do2 main2specific term more precisely and more effectively than previous term extraction approaches. Domain2specific terms are applied in text classification as the feature space. Experimental results indicate that it achieves better performance than traditional feature se2 lection methods. Key words : domain2specific term ;information entropy ; normalization ;text classification ;feature selection 1 [1 ],. [2,3 ] [4 ] [5 ].,,,... [4 ],.,,. [5,6 ] [2,3 ]. Jianfeng Gao [5 ] Henri Avancini [6 ]. [7 ] bootstrapping.. TFIDF [3 ] KFIDF [8 ] DR + DC [2 ]. KFIDF TFIDF,. DR DC,,.,., [9 ],. (1). (2).,..,. :2005210221 ; :2006211223 : (No. 60673037)
2 : 329 2 211, : m : D i (1 i m) :i n i (1 i m) : D i P( D i W) : W D i d ij (1 j n i ) : D i j l ij : d ij, L i : D i WS Di : D i WS rel : WS irre : WS : WS, WS rel WS irre = WS, WS rel WS irre = g, WS Di Α WS rel,, WS D1, WS D2,, WS Dm WS rel. WS. 212, [13 ] corpus distribution(cd) : CD ( W) = - P( D i W) log P( D i W) (1) i =1 CD( W), W. ( P ( D i W) ) [2 ], W P ( D i W),W,.,,P( D i W), W CD( W). 2003 863 : A( ) 015, A ( ) D( ), 11, A (D ),A.,.,,,,. W :, NCD ( W) = - m P ( D i W) log P ( D i W) (2) i =1 P ( D i W) = 213 m P( D j W) / L i j =1 ( P( D j W) / L j ) NCD W.. [2],,NDD ( W, D i ) W D i : NDD ( W, D i ) = - (3) n i P ( d ij W) log P ( d ij W) (4) j =1 P( d ij W) / l ij P ( d ij W) = n (5) i P( d ik W) / l ik k =1 NDD ( W, D i ), W D i. W D i,d i, W,. G( ),,, G.,,.. 214 : D,NCD,NDD : (1) for i = 1 to m do (2) for j = 1 to n i do (3) d ij W WS (4) end for (5) L i (6) end for, l ij 1 (7) for all ( W WS) do (8) NCD ( W) (9) if NCD ( W) <then (10) x = arg max ( P ( D x W) ) X (11) NDD ( W, D x ) (12) if NDD ( W, D x ) <then (13) W WS Di (14) end for 3.,.
330 2007. [10 ] K, Okapi,,.,,.,,. [11 ] : (TFIDF) ( ECE) 2 (CHI) (MI) ( IG) (WE) (DF). TFIDF DF,,.,.,.,RS( W, D x ) W D x,m D x n x NCD NDD. 015(5). RS( W, D x ) = - NCD ( W) / logm + (1 - ) NDD ( W, D x ) / log n x (6) 4 411 2003 2004 863,, T Z, 36, 100., (1), (2).,. 412 2003, (215) (015). 1.,,( ). R 1 TS 2, P ( D W) 015,NCD 2.,,. NCD,., ( ),, NCD,.. R 2 TS
2 : 331 R 3 D A TS 3 NCD NDD., 40.,,,., R TS, ( ), 100,,.,. 413,. [2 ] :,.,,. NCD + NDD DR + DC [2 ]. DR + DC (0135,0149), NCI + NDI,. 4. DR + DC, B 1776,. NCD + NDD,. (7) [12 ] TW DR + DC,3 4 200 200, 4 DR + DC NCD + NDD B 88830 1776 881 E 41030 621 677 H 38666 638 741 R 18182 444 571 TD 27925 318 162 TS 21792 257 358, DR + DC,, DR + DC. 414,,. 5 (7) F1. 5, NCD + NDD CHI DR + DC F1 419 318. 5 F1 MI 0. 419 0. 409 0. 414 DF 0. 556 0. 529 0. 542 WE 0. 564 0. 541 0. 552 IG 0. 559 0. 546 0. 552 TFIDF 0. 596 0. 572 0. 584 ECE 0. 617 0. 597 0. 607 KFIDF 0. 616 0. 601 0. 608 CHI 0. 633 0. 602 0. 617 DR + DC 0. 631 0. 626 0. 628 NCD + NDD 0. 663 0. 669 0. 666 5.,,
332 2007,. NCD + NDD,.. : [1 ] Boguraev B, Kennedy C. Applications of term identification technology : domain description and content characterisation [J ]. Natural Language Engineering,1999,5 (1) :17-44. [2] Velardi P,Missikoff M,et al. Identification of relevant terms to support the construction of domain ontologies [ A ]. Proceedings of the Workshop on Human Language Technologies and Knowledge Management[ C ]. France :ACM Press,2001. 1-8. [3 ] Maedche A, Staab S. Ontology learning. Handbook on Ontolo2 gies in Information Systems [ M ]. Heidelberg : Springer2Verlag, 2004. 173-190. [4] Oakes M P, Paice C. Term extraction for automatic abstracting. Recent Advances in Computational Terminology [ M ]. Amster2 dam/ Philadelphia :J ohn Benjamins Publishing Company, 2001. 353-370. [ 5 ] Gao J, Goodman J, et al. The use of clustering techniques for language modeling2application to Asian language[j ]. Computa2 tional Linguistics and Chinese Language Processing, 2001, 6 (1) :27-60. [6 ] Avancini H, Lavelli A, et al. Expanding domain2specific lexi2 cons by term categorization [ A ]. Proceedings of 18th ACM Symposium on Applied Computing[ C ]. US :ACM Press,2003. 793-797. [7 ],. Bootstrapping [ A ]. [ C ]. :,2003. 67-72. [8 ] Xu F, Kurz D, et al. A domain adaptive approach to automatic acquisition of domain relevant terms and their relations with bootstrapping[ A ]. Proceedings of the 3rd International Confer2 ence on Language Resources and Evaluation [ C ]. Spain :LREC press,2002. 224-230. [ 9 ] Liu T, Wang X L,et al. Domain2specific term extraction and its application in text classification [ A ]. Proceedings of 8th J oint Conference on Information Sciences [ C ]. USA : World Scientific Press,2005. 1481-1484. [10 ] Wang Q,Wang X L,et al. A study of semi2discrete matrix de2 composition for LSI in automated text categorization[ A ]. Pro2 ceeding of 1st International Joint Conference on Natural Lan2 guage Processing[ C]. China :Springer2Verlag,2004. 606-615. [11 ] Yang Y, Pedersen J O. A comparative study on feature selec2 tion in text categorization[ A ]. Proceeding of 14th International Conference on Machine Learning[ C ]. US :AAAI press,1997. 412-420. [12 ] Navigli R,Verlardi P. Learning domain ontologies from docu2 ment warehouses and dedicated web sites [ J ]. Computational Linguistics,2004,30 (2) :151-179. [13 ] O Duda R,et al. Pattern classification (Second Edition) [ M ]. Beijing : China Machine Press,2003. 321-322. :,1981... E2mail :tliu @insun. hit. edu. cn,1970,. Web.,1967,..