18 2 JOURNAL OF CHINESE INFORMATION PROCESSING Vol118 No12 :1003-0077 (2004) 02-0073 - 07 Ξ 1,2, 1, 1 (11, 215006 ;21, 210000) : ISO/ IEC 10646,,,,,, 9919 % : ; ; ; ; : TP39111 :A Research of Han Character Internal Codes Recognition Algorithm in the Multi2lingual Environment L I Pei2feng 1,2,ZHU Qiao2ming 1, Q IAN Pei2de 1 (11Computer Science and Technology School,Suzhou University, Suzhou, Jiangsu 215006,China ; 21Department of Computer Science and Engineering, Southeast University, Nanjing, Jiangsu 210000, China) Abstarct : Itπs a general tendency that the Han Character Internal Codes used in computer should transfer to ISO/ IEC 10646, but there are multi2han Character Internal Codes used in the computer now, and this instance will stand a long time. So how to realize the Han Character Internal Codes auto recognition is the key to build a Multi2 lingual Environment. This paper mainly discusses the Han Character Internal Codes recognition algorithms in the Multi2lingual Environment, and provides four recognition algorithms, such as Internal Code Bound Recognition Al2 gorithm, Interpunction Recognition Algorithm, Han Character Frequency Recognition Algorithm and Semantic Recognition Algorithm. This paper also evaluates the algorithms mentioned in this paper, and the rate of Recogni2 tion can reach 99. 9 % used these recognition algorithms on the test documents. Key words : computer application ; Chinese information processing ; multi2lingual environment ; han character inter2 nal code ; recognition algorithm 1 ISO/ IEC 10646,,,, [1 ] : (1) ; (2) Ξ :2003-10 - 13 : (01kjb520001) : (1971 ),,, 1 73
(ANSI Code) 10646 [3 ] Windows,,, 10646 /, 2 : (1) GB2312 GB K GB18030 ; (2) B IG - 5 B IG - 5E H KCS, ; (3) 10646 Unicode,10646 Unicode [4 ], 10646 211 GB2312/ GBK/ GB18030 GB2312, : GB2312 = { GB h GB l 0xB0 GB h 0xF7, 0xA1 GB l 0xFE} ; GB K GB2312 GB13000, GB2312, ISO/ IEC10646-1 :1993, : GB K = { GB h GB l 0x81 GB h 0xFE,0x40 GB l 0xFE GB l 0x7F} ; GB18030, GB K, 6580 : GB18030 = GB18030 k1 GB18030 k2 GB18030 k3 GB K; GB18030 k1 = { GB h1 GB l1 GB h2 GB l2 GB h1 = 0x82,Ox30 GB l1 0x34,0x81 GB h2 0Xfe, 0x30 GB l2 0x39} ; GB18030 k2 = { GB h1 GB l1 GB h2 GB l2 GB h1 = 0x81, GB l1 = 0x39,0xEF GB h2 0Xfe,0x30 GB l2 0x39} ; GB18030 k3 = { GB h1 GB l1 GB h2 GB l2 GB h1 = 0x82, GB l1 = 0x35,0x81 GB h2 0x87,0x30 GB l2 0x39} ; : GB2312 < GB K < GB18030 GB18030 212 BIG- 5/ BIG- 5 + / HKCS B IG - 5 ( ), : :B IG - 5 1 = {B IG h B IG l 0xA4 B IG h 0xC6, 0x40 B IG l 0xFE} ; :B IG - 5 2 = {B IG h B IG l 0xC9 B IG h 0xF9, 0x40 B IG l 0xFE} ; :B IG - 5 0 = {B IG h B IG l 0xA1 B IG h 0xA3, 0x40 B IG l 0xFE} ; :B IG - 5 = B IG - 5 1 B IG - 5 2 B IG - 5 0 B IG - 5 + B IG - 5E, B IG - 5 + 3954, 4082, : :B IG - 5 3 = {B IG h B IG l 0x81 B IG h 0x87, 0x40 B IG l 0xFE} ; :B IG - 5 4 = {B IG h B IG l 0x8E B IG h 0xA0, 0x40 B IG l 0xFE} ; B IG - 5E :B IG - 5E = B IG - 5 B IG - 5 3 B IG - 5 4 2001, - 2001 [4 ], H KCS, 4818 : 74
:B IG - 5 5 = {B IG h B IG l 0xFA B IG h 0xFE, 0x40 B IG l 0xFE } ; : B IG - 5 4 ; :B IG - 5 6 = {B IG h B IG l 0x81 B IG h 0x8D, 0x40 B IG l 0xFE } ; :B IG - 5 7 = {B IG h B IG l 0xC6 B IG h 0xC8, 0xA1 B IG l 0xFE} ; H KCS : H KCS = B IG - 5 4 B IG - 5 5 B IG - 5 6 B IG - 5 7 ; H KCS B IG - 5 :B IG,B IG - 5 B IG - 5E B IG H KCS, B IG - 5E B IG 213 ISO10646 H KCS = B IG - 5 H KCS H KCS, B IG - 5E B IG H KCS 1984, ( ISO) (Universal Multiple - Octet Coded Character Set, UCS) [1 ], ISO/ IEC10646 10646 0 0 (Basic Multi - lingual Plane, BMP) [1 ], BMP (CJ K), : CJ K1993 = {CJ K l CJ K h 0x4E CJ K h 0x9F, 0x00 CJ K l 0xFF} ; ISO 1993 ISO/ ISO/ IEC10646-1 2000, : IEC10646-1 1993 2000 ISO CJ KK = {CJ K l CJ K h 0x34 CJ K h 0x4D, 0x00 CJ K l 0xFF} ; ISO/ IEC10646-1 2000 CJ K2000 = CJ K1993 CJ KK 214,, : GB18030 B IG - 5 ( B IG - 5E B IG 10646 : (1) GB18030 B IG - 5 10646 H KCS) (2) B IG - 5,, B IG - 5E B IG H KCS, B IG - 5E H KCS (3) 10646 GB18030 B IG - 5E H KCS, ISO,,,, Unicode,, 10646 3 T ( T,, ASCII ) T, P i, T = P 0 P 1,, P i,, P n ( n 0), :ASCII 0x0D0A 10646 0x000D000A,, T 75
P i (1 i n), : P i = B 0, B 1,, B j,, B m (1 j m, B j j ) GB18030 B IG - 5E B IG H KCS 10646, ( ), ANSI - ISO, GB18030 B IG - 5E B IG H KCS 10646 : ConvTo ISO ( P, ANSICODE) ANSICODE P, GB18030 B IG - 5E B IG 10646 311 H KCS, IC2 SCAN P i,,, : (1) : j = 0 C i0 C i2 P i GB18030 B IG - 5 10646, 0 ; (2) j + 3 > m, P i 4, (4) ; S = B j B j + 1 B j + 2 B j + 3 ; (3) S GB18030 k, C i0 2, j 4, (2) ; (4) j + 1 > m, P i 2, ; S = B j B j + 1 ; (5) S GB K, C i0 1 ; S B IG H KCS, C i1 1 ; S CJ K2000, C i2 1 ;, (6), j 2, (2) ; (6) B j ASCII, B j + 1 = 0, S 10646, C i2 1, j 2, (2) ; (7) B j ASCII, B j + 1 ASCII, j 1 1 A S CII, (2) ; j 2 2, (2) ; P i C i0 C i2, C i0 C i2 P i m, CP i0 CP i2 (0 CP ij 1, 0 j 2), P k (0 k n) CP j0 CP j2 (0 1), CP i0 - CP i1 CP i2, P i GB18030 ( GB2312 GB K) ; CP i2 - CP i0 CP i1, P i 10646 ; CP i1 - CP i0 CP i3, P i B IG - 5E B IG H KCS CP ij,, P 1 P ( i - 1) P ( i + 1) P n P i ij P i j, (0 1), P i j : ij = (1 - ) CP ij + ( / 2) ( ( i - 1) j + ( i + 1) j ) (1 i n, 0 j 2) CP 0 j = 1, CP ( n + 1) j = 1 (0 j 2), : ij = (1 - ) CP ij + (1 - ) ( / 2) CP ( i - 1) j + + (1 - ) ( / 2) k - 1 CP ( i - k) j + + (1 - ) ( / 2) i - 1 CP 0 j + (1 - ) ( / 2) CP ( i + 1) j + + (1 - ) ( / 2) k CP ( i + k) j + + (1 - ) ( / 2) n - i CP ( n + 1) j P i, ij j 312 76,
, : g GB K GB18030 GB2312, SGB, ; g B IG - 5 SB IG; g 10646 SISO ; g SGB SB IG SISO = g, ; g, B IG - 5 GB 10646, B IG - 5E B IG H KCS, P i, ICSCAN, P i, B IG - 5 GB 10646 S P 0 S P 2, P i, 313,,1000 9014 %, 2500 97197 % [2 ],, [2 ],, P i,,,,, 3 ZP1 ZP2 ZP3 ( 90 % ),,, Internet : 10646, ( ) 10000, 1 P i, ICPSCAN ICSCAN : (1) : j = 0, C i0 C i5, 0 ; (2) j + 3 > m, P i 4, (4) ; S = B j B j + 1 B j + 2 B j + 3 ; (3) S GB18030 k, C i0 2 (2 ), j 4, (2) ; (4) j + 1 > m, P i 2, ; S = B j B j + 1 ; (5) S GB K, S 10646 S 1 = ConvTo ISO ( S, GB18030), ZP1 S 1, C i0 = C i0 + S 1, C i0 = C i0 + 1 ; (6) S B IG H KCS, S 10646 S 1 = ConvTo ISO ( S, B IG H KCS), ZP2 S 1, C i1 = C i1 + S 1, C i1 = C i1 + 1 ; (7) S B IG - 5E, S 10646 S 1 = ConvTo ISO ( S,B IG - 5E), ZP3 S 1, C i2 = C i2 + S 1, C i2 = C i2 + 1 ; (8) S CJ K2000, ZP1 S, C i3 = C i3 + S, C i3 = C i3 + 1 ; ZP2 S, C i4 = C i4 + S, C i4 = C i4 + 1 ; ZP3 S, C i5 = C i5 + S, C i5 = C i5 + 1 ; (9) (5) (6) (7) (8), (10), j 2, (2) ; (10) B j ASCII, B j + 1 = 0, S 10646, C i3 C i4 C i5 1, j 2, (2) ; (11) B j ASCII, B j + 1 ASCII, j 1 1 ASCII, (2) ; 2 77
, j 2, (2) ; B IG ICPSCAN, C ij, j = 0 GB 18030 ; j = 1 H KCS ; j = 2 B IG - 5E ; j = 3 j = 4 j = 5 10646, GB18030 B IG 314 H KCS B IG - 5E,,,,,,,,,, / / ( ),, R i P i, R i = P i GB18030 B IG - 5E B IG CZ2 CZ3, 10646, H KCS 4, CZ1 :int Get Rel ( Q i, CZ), Q i 10646, CZ, CZ1 CZ2 CZ3, Q i CZ : (1) 10646 Q i = U 0, U 1,, U j,, U n (1 j m), j = 0, R = 0 ; (2) j m, R ; S = U j,, U m ; (3) CZ U j U j + 1, C, U j,, U k C ( j + 3) k m, 2 ),, L en = ( k - j + 1) / 2, R = R + L en, j = k + 1, (2) ; (4) j = j + 1, (2) ; P i, P i Unicode ( P i Unicode, ), Get Rel P i : (1) P i GB18030 R i0 = Get Rel (Convert ( P i, GB18030), CZ1) ; (2) P i B IG - 5E R i1 = Get Rel (Convert ( P i,b IG - 5E), CZ2) ; (3) P i B IG H KCS R i2 = Get Rel (Convert ( P i,b IG H KCS), CZ2) ; (4) P i 10646, GB18030 R i3 = Get Rel ( P i, CZ1) ; (5) P i 10646, B IG - 5E R i4 = Get Rel ( P i, CZ2) ; (6) P i 10646, B IG H KCS R i5 = Get Rel ( P i, CZ3) ;, R ij (0 j 5), P i 4,,, B IG - 5 GB18030 10646, B IG - 5E B IG 78 H KCS, 10646
, 10646,, Internet, : (1), ; 30 % ; (2) 10, 99. 9 % ; 10,, ; (3), ; (4) ; (5), 50 %, 2 10,, 99 %,, 90 % ; (6), 10, Windows,Linux Unix Unicode ( 10646), Windows, Unicode GB K, Windows, Unicode B IG - 5,,, Mi2 crosoft Internet Explorer, HM TL,,, Microsoft Word NJ WIN,,,,,, 10646,,, 10646, 5,,,,,,, 10646, : [1 ] International Organization for Standardization ( ISO), Universal Multiple2Octet Coded Character Set (UCS) [ S] :,International Standard, Ref. No. ISO/ IEC 10646-1 :1993 ( E) / 10646-1 :2000 ( E) / 10646-2 :2001 ( E). [2 ] 1 [ M ]1,1997,5 6 [3 ] 1ISO/ IEC 10646-1 and Unicode [ R ]1CharacterCode & Data To Come, 1996 [4 ] Unicode,www. unicode. org/ versions/ Unicode41010/ appc. pdf [ EB ],2003 79