22 5 2008 9 J OU RNAL OF CH IN ESE IN FORMA TION PROCESSIN G Vol. 22, No. 5 Sep., 2008 : 100320077 (2008) 0520030209,,, (,100871) :,,,,, SVM, : SVM, : ;;; ; ; : TP391 : A Learning to Identif y Chinese Comparative Sentences HUAN G Xiao2jiang, WAN Xiao2jun, YAN G Jian2wu, XIAO Jian2guo ( Institute of Computer Science and Technology of Peking University, Beijing 100871, China) Abstract : Comparison is a common kind of expression, and it is novel and substantial research to extract comparative relations between object s. Identifying comparative sentences in natural language is an important step in extracting comparative relations. To our knowledge, there is no research on identifying Chinese comparative sentences automatically. This paper first defines the problem of Chinese comparative sentence identification, and then proposes to use SVM to classify a Chinese sentence into either comparative or not. Various linguistic and statistical features have been explored, such as keywords and sequential patterns. Experimental result s demonstrate the effectiveness of the sequential patterns, i. e. the classifier with sequential patterns can significantly outperform the traditional term2 based classifier. We also empirically investigate the important factors that affect classification performance. Key words : comp uter application ; Chinese information p rocessing ; Chinese comparative sentences identification ; comparative mining ; text classification ; sequential pattern 1,,, ;,;,, Jindal [ 1 ], [2 ] Zhai Cross2 Collection Mixture Model [3, 4 ] ; Sun [ 5 ] L uo [6 ] Web : 2008204203 : 2008206227 : 863 (2008AA01Z421) ; (60703064) ; (20070001059) : (1984 ),,, ;(1979 ),,,, ;(1973 ),,,, SGML/ XML
5 : 31, ; Feldman [7 ], [8 10 ] [11, 12 ] [13 ], Web,,,, SVM,,,,,, : 2, ; 3 ; 4 ; 2 2. 1, Lerner [14 ], Stassen [15 ], 1898,,, [10 ], () ( ) : :,,,,,,,,,, : 1,, 2. 2,,,,X Y R X Y RX / Y RX Y R, RY / RX R Y,X,,, R, Y / R X Y R, X R,,, X R, X, Y,R
32 2008, :,,, Y,X,,, [16 ] :,,, 2. 3 2. 3. 1 (), ( ),, ( ),, ( / / ), 2. 3. 2,,,,, : 10cm,,,,,, 2. 3. 3, [17 ], : 2. 3. 4, [12 ],X Y ( R) X R YX / Y ( R), [17 ] : + +,,,,,,, [13 ] 2. 3. 5,,,,,, 3 3. 1,: a) b),,f : S C,, S, C a, C = {, } ; b, C = {,,,, } a,,,, ;,, 3. 2 SVM ( Support Vector Machine, SVM)
5 : 33 Boser [ 18 ],,, w x + b = 0, 2/ w : D = xi, ci xi R p, ci - 1,1, w b : ci xi w + b - 1 0, Πi, SVM w b w 2, x, SVM : f ( x) = sgn (w x + b) = + 1 if w x + b > 0-1 ot herwise,svm SVM, [19 ] [20, 21 ] [22 ], 3. 3 2. 3, ( ), [12 ], A,,,, [1 ],,, SVM 3. 4,,, 3. 4. 1,,,, I = { i1, i2,, in}, X s, a1 a2 ar, ai, s s1 = a1 a2 ar s2 = b1 b2 bm, 1 j1 < j2 < < j r - 1 m,a1 Αbj1, a2 Α bj2,, ar Αbjr,s1 s2,s2s1, D, D = { ( s1, c1 ), ( s2, c2 ),, ( sn, cn) }, si, ci C (Class Sequential Rule, CSR) X c,x, c C D d = ( si, ci ) CSR : X c, X s i, d CSR ;d CSR, c = ci,d CSR ( Support) D ( Confidence) D 3. 4. 2,, Jindal ( ),,, CSR,, Jindal, ;,
34 2008 : / n / a 8848/ q / m, / v / p / n / n, 3 / a / p C, Jindal,3, 7 : / n / n / p / t 65/ m nm/ q / n / n 64/ m / q / n / d / a, same as, as as,,,, :,,,,, 3. 4. 3 CSR,CSR CSR,GSP [ 23 ] PrefixSpan [ 24 ] CSR PrefixSpan,,, [ 25 ] Jindal,,, : sup ( r) > min ( f i), f ir i, (0, 1),, min ( f i) < 1/ N ( N ),,: sup ( r) > max (min ( f i), s), s 1/ N, = 0. 1, s = 2/ N,0. 65 B 3. 4. 4 CSR R, s R s s R = { r1, r2,, rm},s f 1, f 2,, f m,, f i = 1 if s r i 0 ot herwise, 1 i m SVM,, (, ), C( sent) = C, if ϖseq ( seq S C( seq) = C NC, ot herwise ),, S sent, C, N C 4 4. 1,,, 2 : 2 1 297 458 4. 2 F 3, http :/ / groups. zol. com. cn
5 : 35 = = t p + t n t p + f p + t n + f n,= t p t p + f p, t p t p + f n, F = 2 t p 2 2 t p 2 + t p f p + t p f n 3 tp fp fn tn, 5,5, 4,1, 5, ( WS) (Cn,n ) ( SS) WS,, 4,, Cn SS, 1 4. 3 4. 3. 1, SVM SVM SVMLight, 4,Baseline, (Bag2of2words),; KW ; WP ; KWP ;CSR 4 F2 Baseline 90. 1 % 96. 7 % 64. 2 % 0. 772 KW 89. 9 % 91. 7 % 67. 5 % 0. 778 WP 90. 5 % 98. 7 % 64. 7 % 0. 781 KWP 91. 2 % 95. 7 % 69. 9 % 0. 806 CSR 92. 7 % 91. 4 % 79. 6 % 0. 850 1,Cn WS, 5,,, SS Cn WS 4. 3. 3,,,?,?,,SS 2,,, KW,, ( WP Baseline, KWP KW),CSR, F Baseline, CSR 23. 9 %, F2 10. 1 %, 5. 5 % 4. 3. 2 3. 4. 2, 2 http :/ / svmlight. joachims. org
36 2008,1,,,, 2 3,,,,F,,,3, 2 3 4. 3. 4 3 1. (),X Y R, CSR,,: a) 3/ u / d NC b) 3/ v / d NC c) 3/ n / d NC d) / d NC e) / d 3/ a C f) / d 3/ v NC g) / d 3/ n NC,e, () 2. X R Y,,SVM CSR, : a) 3/ a / p C b) 3/ v / p NC c) / p 3/ a NC d) / p 3/ n NC,a 3. ( ),CSR : a) 3/ m / p NC b) 3/ r / p NC c) 3/ a / p C d) 3/ v / p NC e) / p 3/ v NC f) / p 3/ n NC,, X R Y,R,,,, 5, SVM,,,,,,,,,,,,,,, : [1 ] N. J INDAL, B. L IU. Identifying Comparative
5 : 37 Sentences in Text Documents[ C ]/ / Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM : 2006 : 2442251. [2 ] N. J INDAL, B. L IU. Mining Comparative Sentences and Relations [ C ]/ / Proceedings of the 21st National Conference on Artificial Intelligence ( AAA I206 ). 2006. [3 ] C. ZHA I, A. V EL IV ELL I, B. YU. A Cross2 Collection Mixture Model for Comparative Text Mining[ C ]/ / Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM : 2004 : 7432748. [ 4 ] P. ZAN G, C. ZHA I. CTMS : a comparative text mining system[ D ]. Champaign : University of Illinois at Urbana2Champaign Computer Science Department, 2004. [5 ] J. 2T. SUN, X. WAN G, D. SH EN, H.2J. ZEN G, Z. CH EN. CWS : A Comparative Web Search System [ C]/ / Proceedings of the 15th International Conference on World Wide Web. ACM : 2006 : 4672476. [6 ] G. L UO, C. TAN G, Y.2L. TIAN. Answering relationship queries on the web [ C ]/ / Proceedings of the 16th international conference on World Wide Web. ACM : 2007 : 5612570. [7 ] R. FELDMAN, M. FRESKO, J. GOLDENBER G, O. N ETZER, L. UN GAR. Extracting Product Comparisons from Discussion Boards[ C]/ / Proceedings of the Seventh IEEE International Conference on Data Mining. 2007 : 4692474. [8 ]. [ M ]. :, 1898. [9 ]. [ M ]. :, 1942. [10 ]. [ M ]. :, 2007. [11 ]. [ M ]. :, 1980. [12 ]. [J ]., 2005, 25 (3) : 60263. [13 ]. [ M ]. :, 2004. [14 ] J. 2Y. L ERN ER, M. PIN KAL. Comparatives and Nested Quantifications [ M ]. Semantics : Critical Concept s in Linguistics. 2004 :70287. [15 ] L. STASSEN. Comparison and Universal Grammar [ M ]. Basil Blackwell, 1985. [16 ]. [ M ]. :, 1982. [17 ]. [ C ]/ /. 4. : 2004 : 12 21. [18 ] B. E. BOSER, I. M. GU YON, V. N. VA PNIK. A Training Algorithm for Optimal Margin Classifiers [ C ]/ / Proceedings of the fifth annual workshop on Computational learning theory. ACM : 1992 : 1442 152. [19 ] T. J OACHIMS. Text categorization with Support Vector Machines : Learning with many relevant features [ C ]/ / Proceedings of the ECML298, 10th European Conference on Machine Learning. Springer : 1998 : 1372142. [20 ],,. SVM [J ]., 2004, 18 (2) : 127. [21 ],. SVM [J ]., 2006, 20 (6) : 172 24. [22 ],,. [J ]., 2000, 14 (3) : 372 41. [23 ] R. SRIKAN T, R. A GRAWAL. Mining Sequential Patterns : Generalizations and Performance Improvement s [ C ]/ / Proceedings of the 5th International Conference on Extending Database Technology : Advances in Database Technology. Springer2Verlag : 1996 : 3217. [24 ] J. PEI, J. HAN, B. MORTAZAV I2ASL, J. WAN G, H. PIN TO, Q. CH EN, U. DA YAL, M.2 C. HSU. Mining Sequential Patterns by Pattern2 Growth : The PrefixSpan Approach [ J ]. IEEE Transactions on Knowledge and Data Engineering, 2004, 16. [25 ] B. L IU. Web Data Mining : Exploring Hyperlinks, Contents, and Usage Data[ M ]. Springer, 2006. A
38 2008 B ( ) / p / a C / a / a C / p / a C 3 / q / a C / p / a C / v 3 / a C / p / a C / v 3 / n C 3 / n / a C 3 / u / v C 3 / r / n C / p / n C / p / n C 3 / d / v C 3 / n / a C / p 3 / a C / p / a C / d / p C / d / n C 3 / n / p C 3 / nt / v C / p / v C 3 / n / v C / d / r C / p / v C / v 3 / nt C 3 / n / p C / v 3 / n C / p 3 / d C / p 3 / a C / v 3 / a C 3 / n / v C 3 / n / n NC / p 3 / d C / p / v C / v 3 / n C / p / v C / p 3 / a C / d 3 / a C / p / a C 3 / n / d C 3 / r / a C / d C / d 3 / a C 3 / a / p C 3 / v / p NC 1 :, 3 2 : :nst v a qr p dcu f eo i j z y nt nr ns nz m w 3 : C,NC (29 ) : [1 ],,,. () [ M ]. :,2003 2. [2 ],,. [J ].,2001, (3) : 21226. [3 ]. [ C ]/ /. :,2006 9 1,2272283. [4 ],,,,. : [J ].,13 (2) :1222158. [5 ]. [ M ]. :,1982 9. [6 ] Yu Jiangsheng, Jin Zhuihui, Wen Zhenshan. Automatic detection of collocation [ C ]/ / Hong Kong : Proceedings of the 4th Chinese Lexica Semantics Workshop, 2003. [7 ],,,. [J ].,2002,16 (5) : 49264, (6) :58265. [8 ],,,. [J ].,2004,18 (5) :1210. [9 ]. () [ M ]. :,2005 2. [10 ],,,. [ C ]/ /. :. 2005 4,2142221. [11 ],,. [ C]/ /. :,2005 : 70276. [12 ],. [ C ]/ /.,2006 8. [13 ],. [ M ]. :, 2001. [14 ]. [ M ]. :, 2001. [15 ]. [J ].,2007,21 (6) :3212.