The State of the Art and Difficulties in Automatic Chinese Word Segmentation

Σχετικά έγγραφα
1530 ( ) 2014,54(12),, E (, 1, X ) [4],,, α, T α, β,, T β, c, P(T β 1 T α,α, β,c) 1 1,,X X F, X E F X E X F X F E X E 1 [1-2] , 2 : X X 1 X 2 ;

Buried Markov Model Pairwise

1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

EM Baum-Welch. Step by Step the Baum-Welch Algorithm and its Application 2. HMM Baum-Welch. Baum-Welch. Baum-Welch Baum-Welch.

( ) , ) , ; kg 1) 80 % kg. Vol. 28,No. 1 Jan.,2006 RESOURCES SCIENCE : (2006) ,2 ,,,, ; ;

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

No. 7 Modular Machine Tool & Automatic Manufacturing Technique. Jul TH166 TG659 A

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

Area Location and Recognition of Video Text Based on Depth Learning Method

Study on the Strengthen Method of Masonry Structure by Steel Truss for Collapse Prevention

Reading Order Detection for Text Layout Excluded by Image

Research on model of early2warning of enterprise crisis based on entropy

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Estimation of stability region for a class of switched linear systems with multiple equilibrium points

Automatic Domain2Specific Term Extraction and Its Application in Text Cla ssification

Research on Economics and Management

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

Applying Markov Decision Processes to Role-playing Game

A research on the influence of dummy activity on float in an AOA network and its amendments

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

ER-Tree (Extended R*-Tree)

Control Theory & Applications PID (, )

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Faruqui [7] WordNet [15] FrameNet [2] PPDB [8]

(Statistical Machine Translation: SMT[1]) [2]

{takasu, Conditional Random Field

Quick algorithm f or computing core attribute

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

Optimization, PSO) DE [1, 2, 3, 4] PSO [5, 6, 7, 8, 9, 10, 11] (P)

Twitter 6. DEIM Forum 2014 A Twitter,,, Wikipedia, Explicit Semantic Analysis,

The Research on Sampling Estimation of Seasonal Index Based on Stratified Random Sampling

Detection and Recognition of Traffic Signal Using Machine Learning

476,,. : 4. 7, MML. 4 6,.,. : ; Wishart ; MML Wishart ; CEM 2 ; ;,. 2. EM 2.1 Y = Y 1,, Y d T d, y = y 1,, y d T Y. k : p(y θ) = k α m p(y θ m ), (2.1

Chinese Journal of Applied Probability and Statistics Vol.28 No.3 Jun (,, ) 应用概率统计 版权所用,,, EM,,. :,,, P-. : O (count data)

Automatic extraction of bibliography with machine learning

Simplex Crossover for Real-coded Genetic Algolithms

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

Αντώνης Βεντούρης. Επίκουρος Καθηγητής Διδακτικής των Γλωσσών Τμήμα Ιταλικής Γλώσσας και Φιλολογίας Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

(Υπογραϕή) (Υπογραϕή) (Υπογραϕή)

Adaptive grouping difference variation wolf pack algorithm

Scrub Nurse Robot: SNR. C++ SNR Uppaal TA SNR SNR. Vain SNR. Uppaal TA. TA state Uppaal TA location. Uppaal

CorV CVAC. CorV TU317. 1

Anomaly Detection with Neighborhood Preservation Principle

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

Homomorphism in Intuitionistic Fuzzy Automata


A Method for Singularity Detection in Fingerprint Images

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Elements of Information Theory

Congruence Classes of Invertible Matrices of Order 3 over F 2

[15], [16], [17] [6] [2] [5] Jiang [6] 2.1 [6], [10] Score(x, y) y ( 1) ( 1 ) b e ( 1 ) b e. O(n 2 ) Jiang [6] (word lattice reranking)

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n


J. of Math. (PRC) Banach, , X = N(T ) R(T + ), Y = R(T ) N(T + ). Vol. 37 ( 2017 ) No. 5

ΔΙΠΛΩΜΑΤΙΚΕΣ ΕΡΓΑΣΙΕΣ


A summation formula ramified with hypergeometric function and involving recurrence relation

clearing a space (focusing) clearing a space, CS CS CS experiencing I 1. E. T. Gendlin (1978) experiencing (Gendlin 1962) experienc-

Application of a novel immune network learn ing algorithm to fault diagnosis

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

Probabilistic Approach to Robust Optimization

Stabilization of stock price prediction by cross entropy optimization

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

Web DEIM Forum 2009 A7-1. Web. Web. Web. Web. 4 Wikipedia. Wikipedia. Web.

VSC STEADY2STATE MOD EL AND ITS NONL INEAR CONTROL OF VSC2HVDC SYSTEM VSC (1. , ; 2. , )

JOURNAL OF APPLIED SCIENCES Electronics and Information Engineering. Cyclic MUSIC DOA TN (2012)

Research of Han Character Internal Codes Recognition Algorithm in the Multi2lingual Environment

ΕΘΝΙΚΗ ΣΧΟΛΗ ΗΜΟΣΙΑΣ ΙΟΙΚΗΣΗΣ

SVM. Research on ERPs feature extraction and classification

[1] DNA ATM [2] c 2013 Information Processing Society of Japan. Gait motion descriptors. Osaka University 2. Drexel University a)

BCI On Feature Extraction from Multi-Channel Brain Waves Used for Brain Computer Interface

Research on vehicle routing problem with stochastic demand and PSO2DP algorithm with Inver2over operator

High order interpolation function for surface contact problem

Motion analysis and simulation of a stratospheric airship

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

Δθαξκνζκέλα καζεκαηηθά δίθηπα: ε πεξίπησζε ηνπ ζπζηεκηθνύ θηλδύλνπ ζε κηθξνεπίπεδν.

ΟΡΓΑΝΙΣΜΟΣ ΒΙΟΜΗΧΑΝΙΚΗΣ ΙΔΙΟΚΤΗΣΙΑΣ

Correction of chromatic aberration for human eyes with diffractive-refractive hybrid elements

ΒΙΟΓΡΑΦΙΚΟ ΣΗΜΕΙΩΜΑ ΛΕΩΝΙΔΑΣ Α. ΣΠΥΡΟΥ Διδακτορικό σε Υπολογιστική Εμβιομηχανική, Τμήμα Μηχανολόγων Μηχανικών, Πανεπιστήμιο Θεσσαλίας.

46 2. Coula Coula Coula [7], Coula. Coula C(u, v) = φ [ ] {φ(u) + φ(v)}, u, v [, ]. (2.) φ( ) (generator), : [, ], ; φ() = ;, φ ( ). φ [ ] ( ) φ( ) []

Ηλεκτρονικά σώματα κειμένων και γλωσσική διδασκαλία: Διεθνείς αναζητήσεις και διαφαινόμενες προοπτικές για την ελληνική γλώσσα

2 ~ 8 Hz Hz. Blondet 1 Trombetti 2-4 Symans 5. = - M p. M p. s 2 x p. s 2 x t x t. + C p. sx p. + K p. x p. C p. s 2. x tp x t.

Εφαρμογή Υπολογιστικών Τεχνικών στην Γεωργία

Optimization Investment of Football Lottery Game Online Combinatorial Optimization

Εφαρμογή Υπολογιστικών Τεχνικών στην Γεωργία

Σπανό Ιωάννη Α.Μ. 148

1 h, , CaCl 2. pelamis) 58.1%, (Headspace solid -phase microextraction and gas chromatography -mass spectrometry,hs -SPME - Vol. 15 No.

Secure Cyberspace: New Defense Capabilities

HOSVD. Higher Order Data Classification Method with Autocorrelation Matrix Correcting on HOSVD. Junichi MORIGAKI and Kaoru KATAYAMA

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ. «Θεσμικό Πλαίσιο Φωτοβολταïκών Συστημάτων- Βέλτιστη Απόδοση Μέσω Τρόπων Στήριξης»

MUL TIL EVEL2USER2ORIENTED AGRICUL TURAL INFORMATION CLASSIFICATION

Ερευνητική+Ομάδα+Τεχνολογιών+ Διαδικτύου+

Αυτόματη Ανακατασκευή Θραυσμένων Αντικειμένων

Computational study of the structure, UV-vis absorption spectra and conductivity of biphenylene-based polymers and their boron nitride analogues

Development of the Nursing Program for Rehabilitation of Woman Diagnosed with Breast Cancer

CAP A CAP

Transcript:

138 Vol. 17 No. 1 JOURNAL OF SYSTEM SIMULATION 1 2 1, ( 1 100080 2 100039) 1004-731X (2005) 01-0138-06 TP391 A The State of the Art and Difficulties in Automatic Chinese Word Segmentation ZHANG Chun-xia 1 2, HAO Tian-yong 1 ( 1 Institute of Computing Technology, Chinese of Academy Sciences, Beijing, 100080, China; 2 Graduate School of the Chinese of Academy Science, Beijing 100039, China) Abstract: Automatic Chinese word segmentation is a basic research issue on Chinese information processing tasks such as information extraction, information retrieval, machine translation, text classification, automatic text summarization, speech recognition, text-to-speech, natural language understanding, and so on. Though it has been investigated for more than twenty years, it is still a bottleneck for Chinese information processing. We give a detailed analysis of the state of the art in automatic Chinese word segmentation, build a formal model of word segmentation, discuss factors affecting word segmentation and the two great difficulties in word segmentation and their resolutions, and finally, point out the existing problems, especially those on the word segmentation evaluation, as well as the research problems to be resolved. Keywords: automatic Chinese word segmentation; formal model; unknown words; word segmentation evaluation 1 (1) [9] (1) (2) 70% [10] ( ) (3) ( ) (4) [9, 11] (2) (3) [1] [6] [2 5, 6-7] [13] [6] [1-2,9,11] [9,11] 2003-12-23 2004-03-02 (#60073017 #60273019) (1) (#2001CCA03000 #2002DEA30036) (2) (1974 ),,,, (1981 ),,,, (3)

, 139 (3) (4) (1) (2) 1 (5) (3) (4) [20] [12] 1.2 1.1 N Markov (1) (2) (1) (2) (1) m (3) m (a) (2) x y p ( x, y) MI(x, y)= log 2 p ( x ) p ( y ) p(x,y) x y p(x) p(y) x y MI(x, y) [21] (1) MI(x, y)>>0 x, y [13] MI(x, y) (2) MI(x, y) 0 x, y Hash [14] (3) MI(x, y)<<0 x, y MI(x, y) Hash (b) N I/O 1.56 [15] [19] Sproat [4] N [16] [17] (c) TRIE [22] [8] (1) 15.3 16.6 [18] (2) (BP PNN) 2.89 [11,15,19] (1) (1) (2) (2) (3) [20]

140 Vol. 17 No. 1 (1) 94.3 (2) (3) (1) (2) (3) (4) 1.3 (d) Markov [9] [11] Markov ( ) Masao Utiyama [23] [27] (1) (2) (3) (1) (2) (4) (5) (6) [11] Markov Palmer [2] (e) EM [24] EM EM 80% 20% Kok [28] (f) [25] N-gram [29] M N 1 M N N-gram N N T={S 1, S 2, S 3,, S m } N-gram m N N S i (1) (2) C={C 1, C 2, C 3,, C n } n N C i (i=1, 2,,n) 1 (g) [26] x 2-2 S=S 1 S 2 S 3 S p p(p N ) S i T (i=1, 2,, p) [20] 83.6 2

, 141 3 ACWSS(Automatic Chinese Word Segmentation System) (St, Lk, Si, A, Ewsr, Gwsr) (1) St (Source Texts) (2) Lk (Linguistic Knowledge) (3) Si (Statistic Information) (4) A (Algorithm) (5) Ewsr (Experimental Word Segmentation Result) Result) (6) Gwsr (Goal Word Segmentation ACWSS S 1, S 1 S 2, S 2 S 3, S 3 Mrd(Machine Readable Dictionary) Si S=S 1 S 2 S 10 S 1 ',S 2 ' 4 S S' S S 2 '=S 1 S 2 /S 3 S 4 /S 5 S 6 /S 7 S 8 S 9 S 10 S'=W 1 /W 2 /W 3 / /W m / / W i Mrd (i=1, 2,,m) W i 5 S S' S S'=W 1 '/W 2 '/W 3 '/ /W m '/ S / S' W i ' 9 S k S 1 '=S 1 S i1 /S i1+1 S i2 /S i2+1 S ij /S ij+1 S ik /S ik+1 S m S 2 '= S 1 S i1 /S i1+1 S i2 /S i2+1 S ij /S ij+1 S ik /S ik+1 S m S S 6 S 1 S S 1 ', S 2 ', S n ' S 1 ' S 2 ' S 1 ' S 2 ' SSN(S') S' n i, j= 1 (Si'- Sj') S SSN(T S1' ) SSN(T S2' ) T S 1 ' S 2 ' S S S=S 1 S 2 S 10 S 1 ', S 2 ', S 3 ' S 1 '=S 1 S 2 /S 3 S 4 /S 5 S 6 S 7 /S 8 S 9 S 10 S 2 '=S 1 /S 2 S 3 S 4 /S 5 S 6 S 7 /S 8 S 9 /S 10 S 3 '=S 1 S 2 /S 3 S 4 /S 5 S 6 S 7 /S 8 /S 9 S 10 S S 1 S 2 S 3 S 4 S 8 S 9 S 10 7 S S 1 S 2 S 1 S 2 S 1 /S 2 / S 1 S 2 / S 1 S 2 S 1, S 2, S 1 S 2 8 S S 1 S 2 S 3 S 1 S 2 S 3 S 1 S 2 /S 3 / S 1 /S 2 S 3 / S 1 '=S 1 S 2 S 3 /S 4 S 5 S 6 /S 7 S 8 S 9 S 10 S 1 S 2 S 3 S 1 S 2 S 3 S 4 S 5 S 6 S S 1 ' S 2 ' S 1 S i1 S i1+1 S i2 S ij+1 S ik S ik+1 S m S / T S SSN(T S1' ) = SSN(T S2' ) T 3 [30]

142 Vol. 17 No. 1 E-step [30] M-setp EM [27] 3.1 [11] [33] (1) (2) (3) [31] 3.2 [32,33] T-Test Markov SVM k-nn [34] [35] EM [32] T- m (1) n (2) n/m [33] T [37] (1) (1) (2) (3) (2) ( (1) ) (1) (2) (3) (2) // (1) [34] Bigram (2) Markov Keh-Jiann Chen [38] [35] SVM K-NN (1) (2) (3) SVM SVM k-nn [36] EM (1) (2) (1) (2) Chiang [39] 4 ME (3)

, 143 (a) (b) [1] 01 (GB12200.1-90) [M]., 1991. [40] [2] Robert Dale, Herman Moisl, Harold Somers. Handbook of Natural Language Processing [M]. New York: Marcel Dekker, Inc, 2000. [33] [3] David D. Palmer. A Trainable Rule-based Algorithm for Word Segmentation [A]. Proceedings of the 35 th Annual Meeting of the Association for Computational Linguistics [C]. 1997.321-328. [4] Richard Sproat, Chilin Shih, William Gale, and Nancy Chang. A Stochastic Finite-State Word-Segmentation Algorithm for Chinese (c) [J]. Computing Linguist, 1996, 22(3): 377-404. [5]. [J]., 1998, 4: 22-26. [6],. 10 [M].. 1994. ACWSS 1 =(St 1, Lk 1, Si 1, A 1, Ewsr 1,, 2001, 15(2): 1-8. Gwsr 1 ) ACWSS 2 = (St 2, Lk 2, Si 2, A 2, Ewsr 2, Gwsr 2 ) [8]. [J]., 1999, 16(10): 8-9. (1) St 1 =St 2 [9]. [J]. (2) Gwsr 1 =Gwsr 2 11 ACWSS 1= (St 1, Lk 1, Si 1, A 1, Ewsr 1, Gwsr 1 ) ACWSS 2 =(St 2, Lk 2, Si 2, A 2, Ewsr 2, Gwsr 2 ) [7]. [J]., 1992, 29(11): 54-58. [10],,. [C].. 2003.26-38. [11]. [J]., 2002, 14(5): 544-550. [12],. [J]., 1999, 36(9): 1143-1147. (1) Gwsr 1 Gwsr 2 [13]. [J]. (2) St 1 St 2 [J], 2002, 18(3): 44-48. 12 [14]. [J]., 2002, 38(11): 106-109. [15]. [J]., F-measure F=(1+β)PR/(βP+R) [3] 2000, 23(2): 84-87. [16]. [J]., 1997, 34(7): 542-545. St 1 St 2 [17]. [J]., 2000, 14(1):1-6. Gwsr 1 Gwsr 2 [18]. [J]., 1996, 33(4): 306-311. [19]. [J]., 1997, (d) 20(4):66-69. [20]. [J]., 2001, 15(6): 33-39. [21] Yubin Dai, Teck Ee Loh, Christopher Khoo. A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information [C]. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999.82-89. [22]. [J]. 5, 1998, 17(1): 41-50. [23] Masao Utiyama, & Hitoshi Isahara. A Statistical Model for Domain-Independent Text Segmentation[C]. The 39th Annual Meeting of the Association for Computational Linguistics and 10th Conference of the European Chapter of the Association for Computational Linguistics. 2001. 491-498. 147

, RBF 147 4 10 1:1956 1957. [6] Cirrincione M Miceli R, Galluzzo. G R and Trapanese M. Preisach function identification by neural networks[j]. Magnetics, IEEE Transactions on, 2002,38(5): 2421 2423. [7] Cincotti S, Marchesi M, and Serri A. A neural network model of RBF parametric nonlinear hysteretic inductors [J]. Magnetics, IEEE Transactions on 1998, 34(5): 3040-3043. [8],,. [J]., 2000, 22(4): 244-245. [9],. [J]. 2002, 13(5): 446-449. [1] Mayergoyz I D, Friedman G, and Salling C. Comparison of the classical and generalized preisach hysteresis models with experiments[a]. Magnetics Conference, Digests of NTERMAG '89. International[C]. 1989, 28-31.HC2 -HC2. [2] Jiles D C, Atherton D L. Theory of ferromagnetic hyseteresis[j]. J. Magn. MagnMater, 1986, 61:48-60. [3] Serpico C, Vison C. Magnetic Hysteresis Modeling via Feed-Forward Neural Networks [J]. IEEE Transactions on Magnetics, 1998,34(3): 623-628. [4] Cincotti S, Daneri I. Neural network identification of a nonlinear circuit model of hyseteresis[j]. Electronics Letters, 1997, 33(13): 1154 1156. [5] Cruz-Hernández J M, Hayward V. On the linear compensation of hyseteresis[a]. in 36th IEEE Conf. Decision Contr.[C]. 1997, 143 [24]. EM [J]. [J]., 2000, 15(1): 28-33., 2002. 21(3): 269-272. [25]. Chinese by a Corpus-based Learning Method [J]. International [J]., 2001, 15(6): 47-52. [26]. [J]. Processing. 1998, 3(1): 27-44., 1996, 9(4): 297-303. [27],,. Statistical Models for Word Segmentation and Unknown Word [J]., 2001, 15(1): 13-18. [28] Kok Wee Gan. Integrating Word Boundary Identification with Sentence Understanding [C]. The 31 st Association for Computational Linguistics. 1993. 301-303. Annual Meeting of the [29],,. [41]. [J]., 1989, 3(1): [A]. 9 [C].. 2001. 987-991. [30].., 2001, [43]. [J]., 15(5): 9-14. [32]. [J]., 2002, 38(11): 125-127. [33]. 6(1): 72-78. [J]., 1997, 34(5): 332-339. [34]., 1995, 4(4):40-46. [J]., 1999, 39(5): 101-103. [35]. SVM K-NN 1997, 6(1): 101-106. [J]., 2001, 15(6): 13-18. [36]. EM, 2003. 25(5): 28-30. [J].. 2001, 15(2): 38-44. [37],,,,. [38] Keh-Jiann Chen, Ming-Hong Bai. Unknown Word Detection for Journal of Computational Linguistics & Chinese Language [39] Tung-Hui Chiang, Jing-Shin Chang, Ming-Yu Lin, and Keh-Yih Su. Resolution [C]. Proceedings of ROCLING-V, ROC Computational Linguistics Conference. 1992.123-146. [40]. [J]., 1999, 8(2): 88-91. 1-9. [42]. [M]., 1997. 1999, 22(2): 141-146. [31]. [M].. 2002. [44]. N [J]., 1998, 12(1): 35-41. [45]. [J]., 1997, [46],. [J]. [47]. [J].. [48],,. [J].