25 3 2009 5 ( ) Journal of Fujian N o rm al U niversity (N atural Science Edition) V o l125 N o13 M ay 2009 : 100025277 (2009) 0320039208 W eb 1, 2 (11, 350108; 21, 361005) : W eb. H TM L, H TM L,,.,,,., W eb. : W eb ; W eb ; : T P3111131 : A W eb Information Extraction Based on Tree Structure REN Zhong- sheng 1, XUE Y ong- sheng 2 (11S chool of M athem atics and Com p u ter S cience, F uj ian N orm al U niversity, F uz hou 350108, Ch ina; 21D ep artm ent of Com p u ter S cience, X iam en U niversity, X iam en 361005, Ch ina) Abstract: It p ropo ses tree structu re based W eb data ex traction algo rithm in view of the inadequacies of the ex isting m ethods. T he tree structu re based algo rithm includes: the algo2 rithm of H TM L tree con struction, the algo rithm of data region m in ing, the algo rithm of data reco rd m in ing, and the algo rithm of reco rd schem a generation. T he algo rithm clean s thew eb pages u sing the po sition info rm ation of page elem en ts, m ines data region by h ierarch ical clu stering, and generates reco rd schem a fin ish ing data item ex traction th rough tree m atch2 ing. Experim en tal resu lts show that ou r algo rithm can imp rove the accu racy and efficiency of W eb data ex traction. Key words: W eb data ex traction; W eb m in ing; info rm ation ex traction H TM L.,, W eb.,, H TM L,,,.,,, W eb. W eb, W eb. H TM L,,,, ( ),. : 2008211225 : ( 50474033) ; (A 0310008) ; (2003H 043) : (1981 ),,,,, :.
40 ( ) 2009 1 W eb 111. 4 : (1) H TM L : H TM L, H TM L. (2) : H TM L,, (3) :,,. (4) : H TM L,,., W eb,. 112 HTML, H TM L,,.,., [1 ]. 113 HTML 11311 H TM L,,.,,,. [ 2 ] W eb W eb,. 4.,,, IBD F,,. [ 3 ] D SE,, ;,,,. [ 4-5 ],,.,, H TM L.,, 1 R egion3 ( www 1joyo1com ).,. 1 H TM L, : (1) n1, n2 Paren t (n 1, n2), n1 n 2. (2) n 1, n2 A ncesto r (n 1, n 2), Paren t (n1, n2) ϖ n3 (Paren t (n 1, n 3) A ncesto r (n3, n2) ). 2 ( A ncesto rset) n1, A n2 cesto rset (n 1) = {n} A ncesto r (n, n1). 3 ( ) H TM L, n 1, n2,,, 1 HTML..,
3 : W eb 41. 11312 H TM L, H TM L, ( ) H TM L,.,,,,, 2.,,.,, H TM L.,,., [ 6 ] Sep, O cq. : : 1 DRM in ing sim ilarity D atar egion A lgo rithm : DRM in ing (sim ilarity) (1) m in = m in (sim ilarity); (2) m ax = m ax (sim ilarity); (3) O cq = 2; (4) fo r (j= m in; j< = m ax; j+ + ) (5) output= new V ecto r (); (6) m intemp = new V ecto r (); (7) m axtemp = new V ecto r (); (8) m axc luster= true; (9) m inc luster= true; (10) fo r (i= 0; i< sim ilarity1l ength; i+ + ) (11) tmp= sim ilarity1get (i); (12) if (tmp> = j) (13) if ( (m intemp 1L ength! = 0) && (m axc luster) ) (14) output1add (m intemp ); (15) if (m axc luster) (16) m axtemp = new V ecto r (); (17) m axtemp1add (tmp); (18) m axc luster = false; (19) m inc luster = true; (20) else (21) if ( (m axtemp 1L ength! = 0) && (m inc luster) ) (22) output1add (m axtemp ); (23) if (m inc luster) (24) m intemp = new V ecto r (); (25) m intemp 1add (tmp); (26) m inc luster = false; (27) m axc luster = true; (28) update output; (29) calculate current curo cq; (30) if (curo cq< O cq) (31) O cq= curo cq; (32) result= output; 2
42 ( ) 2009 (33) D atar egion= result s largest sub2cluster; (34) return D atar egion; sim ilarity m in m ax, [m in, m ax ],, O cq h. h, sim ilarity, h, sim ilarity [ i- 1 ] h, sim ilarity [ i ] h,,, sim ilarity [ i- 1 ] h, sim ilarity [ i ] h,. O cq,,.. DRM in ing. (, Sep ) : S = [ 6, 7, 8, 7, 6, 9, 7, 8, 7, 6, 8, 7, 6, 8, 7, 6, 9, 8 ], m in (S ) = 6, m ax (S ) = 9, h= 6, S : {6, 7, 8, 7, 6, 9, 7, 8, 7, 6, 8, 7, 6, 8, 7, 6, 9, 8}; h= 7, S : {6} {7, 8, 7} {6} {9, 7, 8, 7} {6} {8, 7} {6} {8, 7} {6} {9, 8}; h= 8, S : {6, 7} {8} {7, 6} {9} {7, 8, 7, 6} {8} {7, 6} {8} {7, 6} {9} {8}; h= 9, S : {6, 7, 8, 7, 6} {9} {7, 8, 7, 6, 8, 7, 6, 8, 7, 6} {9} {8}. : Sep (6) = na; Sep (7) = 01694 075 827 761 916 8; Sep (8) = 014950336703414039; Sep (9) = 0138543464816368556. h= 9, {7, 8, 7, 6, 8, 7, 6, 8, 7, 6}.. S {6, 7, 8, 7, 6, 9, 7, 8, 7, 6, 8, 7, 6, 8, 7, 6, 9, 8},, {8, 7, 6, 8, 7, 6, 8, 7, 6} ( 3 ),.,. 4 2. 3 S 114 HTML 11411 4, R eco rdm in ing.. 4 ( ) H TM L, : (1) H TM L, < tag> < gtag>. (2) H TM L,,. (3) H TM L.,,. 11412,,. H TM L,, 2 R eco rdm in ing.
3 : W eb 43 : D atar egion, leafpath : R eco rdset A lgo rithm : R eco rdm in ing (D atar egion, leafpath) (1) m = first node index in D atar egion; (2) M = last node index in D atar egion; (3) T = last node in leafpath [m ], leafpath [m + 1 ], leafpath [M ] p refix; (4) flag= true; (5) t= T; (6) w h ile (flag) (7) ch ildn ode= tπs ch ild node of first level; (8) count= ch ildn ode1size; (9) m ax= 0; (10) fo r (i= 0; i< count; i+ + ) (11) num = the num ber of ch ild node of ch ildn ode [ i]; (12) if num > m ax (13) idx= i; (14) m ax= num ; (15) tmp= t1ch ildn ode [ idx ]; (16) if the num ber of tmpπnodes gthe num ber of tπnodes > facto r (17) t= tmp; (18) else flag= false; (19) R eco rdset= tπs ch ild node of first level; (20) tag= tag nam e appears mo st frequently in R eco rdset; (21) avgsize= the num ber of tπs ch ild nodes gr eco rdset1l ength; (22) fo r each elem ent in R eco rdset (23) if (elem em tπs tagnam e! = tag) (24) (the num ber of elem entπs ch ild nodes< avgsizeg2) (25) delete elem ent from R eco rdset; (26) return R eco rdset;, ( 1-3 ), ( 4-18 ).,,.,,,,.,,,, ( ),. facto r ( 16-18 )., ( 19-25 ).,, H TM L,, ;,,,,,..., 115.,,,. H TM L, ( ),
44 ( ) 2009. [7 ].,,,,, :. 3 R eco rdschem agen H TM L S : schem a, tab le A lgo rithm R eco rdschem agen (S) (1) get the tree w ith the m axim um num ber of nodes as the seed tree; (2) count= 0; (3) w h ile (S is no t emp ty) and (count< m axcount) (4) t= get one tree from S in sequence; (5) m atch (seed, t); (6) if (all item s of t are m atched w ith co rresponding item s of seed) (7) label co rresponding nodes; (8) delete t from S; (9) else (10) tempo rarily delete t; (11) if (S is emp ty) (12) if (S has nodes tempo rarily deleted) (13) add nodes to S; (14) count+ + ; (15) generate reco rd schem a; (16) generate table acco rding to schem a; (17) fo r each tree in S (18) fo r each item in current tree (19) if (item in schem a) (20) insert item into table; (21) else (22) add co lum n to table; (23) insert item into co lum n; (24) update schem a; (25) return table and schem a;,, ( 1 ).,., schem a,, ( 3-25 ).,,, -,,,, ( 10-14 ), m axcoun t.,,,, ( 15-24 )., O (gsg). 5 m atch,.,,.,. (1),.
3 : W eb 45 (2),. H TM L (, ). 1: < table> 2: < tr> 3: < td> < table> 4: < tr> 5: < td> 1< td> 6: < td> 1< td> 7: < gtr> 8: < tr> 9: < td> 1< td> 10: < td> 1< td> 11: < gtr> 12: < gtable> < gtd> 13: < gtr> 14: < tr> 15: < td> < table> 16: < tr> 17: < td> 2< gtd> 18: < td> 2< td> 19: < gtr> 20: < tr> 21: < td> 2< td> 22: < td> 2< td> 23: < gtr> 24: < gtable> < gtd> 25: < gtr 26: < gtable> H TM L H TM L (1, 1, 1, 1) (2, 2, 2, 2).,,,.,. 2,. M 4600, W indow s 2003 Sever, JBuilder2006. [ 8 ] R ecall P recision. 1. 1 T P FN FP R ecall P recision 1 20 20 0 0 100% 100% 2 24 NA NA NA NA NA 3 16 NA NA NA NA NA 4 10 10 0 0 100% 100% 5 30 24 6 0 80% 100% 6 16 16 0 0 100% 100% 7 15 NA NA NA NA NA 8 10 10 0 0 100% 100% 9 40 40 0 0 100% 100% 10 13 13 0 0 100% 100% 1. [ 1 ] 10. R ecall P recision,. NA., 2, 3, 7. 3, 16, 16 4 4,, 4, 4, 4, 4.,,, 4.,.,.
46 ( ) 2009,.., 3, W eb,,,,,,,.,. : [1 ]. H TM L [J ]., 2009, 25 (1): 60-61. [2 ] Sandip D ebnath, P rasenjit M itra, N irm al Pal, et al. A utom atic identification of info rm ative sections of w eb pages [J ]. IEEE transactions on know ledge and data engineering, 2005, 17 (9): 1233-1246. [ 3 ] J iying W ang, F red H L ochovsky. D ata2rich section extraction from H TM L pages [C ] P roceedings of the 3rd International Conference on W eb Info rm ation System s Engineering (W ISEπ02), 2002: 313-322. [4 ] L an Y i, B ing L iu, X iao li L i. E lim inating no isy info rm ation in w eb pages fo r data m ining [C ] P roc N inth A CM S IGKDD Intπl Conf. Know ledge D iscovery and D ata M ining, 2003: 296-305. [5 ] L an Y i, B ing L iu. W eb Page C leaning fo r w eb m ining th rough feature w eigh ting [C ] P roceedings of E igh teenth International Jo int Conference on A rtificial Intelligence. A capulco, M exico, 2003: 9-15. [ 6 ] J i H e, A h2hw ee T an, Chew 2L im T an, et al. O n quantitative evaluation of clustering system s [J ]. Info rm ation R e2 triveal A nd C lustering, 2002: 105-134. [ 7 ]W uu Yang. Identifying syntactic differences betw een tw o p rogram s [J ]. Softw are2p ractice and Experience, 1991, 21 (7): 739-755. [ 8 ]R aghavan V V,W ang G S, Bo llm ann P. A critical investigation of recall and p recision as m easures of retrieval system perfo rm ance [J ]. A CM T rans Info rm ation System s, 1989 (3): 205-229. ( : )