39 5 ( ) Vol. 39 No. 5 2009 9 Journal of Jilin University ( Engineering and Technology Edition) Sept. 2009 1, 2, 2, 2, 3 (1., 610075 ;2., 610065 ;3., 610041) :,, DOM2Tree :,;,, :; ;;Web : TP311. 13 :A :167125497 (2009) 0521326205 Ne ws content extraction based on block distribution Q IU Jiang2tao 1,2, TAN G Chang2jie 2, L I Chuan 2, ZHU J un 3 (1. School of Economic I nf ormation Engineering, S outhwestern Universit y of Finance and Economics, Cheng du 610075, China; 2. College of Com p uter S cience, S ichuan Universit y, Cheng du 610065, China; 3. N ational Center f or B i rth Def ects Monitoring, Cheng du 610041, China) Abstract :An app roach to ext ract new s co ntent s auto matically f ro m news web pages is p ropo sed. Co mpared wit h existing met ho ds, t his app roach can determine whet her a web page co ntains news content first, t hen extract the news content s wit hout using DOM2Tree or template. A new concept of Block is introduced and by one traversal the approach partitions web page into main content block and noise block. Furt her more, the concept of Web Page Block Distribution is introduced and t he feat ures of Block Dist ributio n are investigated. The use of Block Dist ributio n can effectively determine whet her a web page contains news content s. Experiment s show t he approach is effective in extraction of news content s. Key words :comp uter application ; Web content s extracting ; block distribution ; Web mining Web Web, Web [122 ], Web Web [ 325 ], ; DOM2Tree [627 ] H TML :2008201208. : (2006BA I05A01) ; (60773169) ; (06036). :(19722),,,. :. E2mail :jiangtaoqiu @google. com :(19462),,,. :. E2mail :tangchangjie @cs. scu. edu. cn
5,: 1327 DOM2Tree,DOM2Tree b, c b,, : ;, ; DOM2Tree DOM2 Tree, DOM2Tree,,, DOM2Tree :,, ;,, 1,,, DOM2Tree,DOM2Tree,,, DOM2Tree,, 1 ( ) S H TML, S H TML < TA G > < / TA G >,< TA G > < / TA G > s = { < TA G >,, < / TA G > } < S si S, ϖsj S, sj < si,? ϖsj S s j < si,sj = g, (si - sj ), B B < TA G > < / TA G > c bi bj, si sj S, si < sj, bi b j 2 (Block2List) B S et,block2list B S et ;Block2List b B S et ; Block2List < t, c >, t Web 1 1 H TML,,< TA G > < / TA G > ; ;,< FORM > < / FORM >,,< br > < hr >, H TML,, 1 G D J, S H TML, B S et S ( G, D, J) S, b1, b2 B S et, b1 b2 = g, b1 b2 1,, H TML,, H TML, H TML 2, ;,,, :, < TABL E > < TR > < TD > < DIV >,,< FON T > < SPAN >,< ST YL E > < SCRIP T > < A >, < A >,< A > b1 b2, b2 b1, b1, H TML < DIV > hello < TABL E > 123 < / TABL E > world < / DIV >,hello world,123, 2 f H TML,st
1328 ( ) 39 ; f,< TA G > p ush ( st), < / TA G >,pop ( st) eof ( f ) = TRU E, st. em pt y () = TRU E 2,H TML,,,,,,, 2 DOM2Tree 1 1 ( ExtractBlock) : H TML f :B L : 1. s = build_aid_stack () ; BL = build_block_list ; 2. while ( NO T EOF of f) { 3. tag = getnext Tag () ; / / 4. content = getcontent () ; / / 5. block = get Top () ; / / 6. insert (content, block) / / 7. if (isneglect (tag) ) continue ; / / 8. if (isj ump (tag) ) / / 9. {jump () ; continue ;} / / 10. if (isopen Tag (tag) ) { / / 11. block = new Block (tag) 12. insert (BL, block) ; / / 13. push (s, tag, block) ; / / 14. }else / / 15. pop (s) ; 16. } 3 f H TML, f Block2List t1,f Dom2 Tree t2,t1 < t2 2,,,,,,, 1 (a) ;,,, 1 (b),,, 1 Fig. 1 Non2main content page and main content page 3 () B L 1, o B L, c o Πo B L, o c, n, { n1,, nk} D = ( n1,, nk) D = ( n1,, nk), Πni N,ni, ni i,, 1, 2 5 2 Fig. 2 Block distribution curves
5,: 1329 1,,,, 2,1, 5 4 Dev ( D), D1 = ( n1,, nm ), D2 = ( n1 + k,, nm), k > 0, D2 1 D1 k, D1,Dev ( D1 ) > Dev ( D2 ) DOM2Tree 4,,,,,, 2,, 4 ( ) D = ( n1,, nk) ; ni, ni D i i ( i = 1,, k - 1), ( D) = max ( 1,, k - 1 ) - min ( 1,, k - 1 ) D 3,3 3 :Dataset1 4 543, 220,323,Dataset1 () ;Dataset2 ( http :/ / www. cwirf. org) CCT2006, 1200,;Dataset3 184 Intel P2. 6 G 512 M, J AVA 1 Block2List DOM2Tree 3 Dataset2 Block2List DOM2Tree 3,, Block2List DOM2Tree 1200, Block2 List DOM2Tree 30 s 3 1, Block2List 3 Fig. 3 Analysis of time performance 2 Dataset1 Dataset2 Dataset1 10, weka Nagve Bayes KNN AD Tree, 1 2. 1 ( A ccuracy) 1 / % Table 1 Comparison of classifying performance/ % 2. 1 2. 2 2. 3 2. 4 NB 96. 9 88. 5 79. 4 84. 6 AD Tree 98. 3 93. 3 85. 9 89. 5 KNN 95. 6 86. 3 80. 5 81. 0 Accuracy = / 2. 1, Dataset1 AD Tree, 98. 3 % 2. 2 Dataset1, NB AD Tree KNN, Dataset2 1 2. 2 81 Dataset2
1330 ( ) 39 AD Tree :18 ;10 ;53 2. 3 2. 4 Dataset1,NB AD Tree KNN, Dataset2 1 2. 2 2. 3 2. 4, 2,,, 3 Dataset1 220 Dataset3 ( BD) TSReC [ 3 ] K2Feat ure Extractor ( K2F) [8 ], 2 2 / % Table 2 Comparison of the accuracy of algorithms Dataset1 Dataset3 TSReC 26. 8 98. 9 BD 97. 7 97. 2 K2F 85. 3 88. 6 TSReC Dataset3,98. 9 %,BD 5,97. 2 %,, Dataset1 TSReC,59,, 26. 8 % BD 5,97. 7 %,,, K2F Web,, K2F 8 %11 % 4,,,,,,, : [ 1 ] Yi L, Liu B. Web page cleaning for Web mining through feature weighting [ C ] International Joint Conference on Artificial Intelligence ( IJ CAI203 ), Acapulco, Mexico, 2003. [ 2 ] Yin X, Lee W S. Using link analysis to improve lay2 out on mobile devices [ C ] the 13th World Wide Web Conference ( WWW 2004), New York, US, 2004. [ 3 ] Li Yu, Meng Xiao2feng, Li Qing, et al. Hybrid method for automated news content extraction from the Web [ C ] Web Information System and Engi2 neering (WISE 2006), Wuhan, China, 2006. [ 4 ] Geng Hua, Gao Qiang, Pan Jin2gui. Extracting con2 tent for news Web pages based on DOM[J ]. Inter2 national Journal of Computer Science and Network Security, 2007,7 (2) :1242129. [ 5 ] Yi L, Liu B, Li X. Eliminating noisy information in Web pages for data mining[ C] ACM SIGKDD In2 ternational Conference on Knowledge Discovery & Data Mining, Washington, DC, USA, 2003. [ 6 ],,,. DOM [ J ]., 2004, 141 (10) : 178621792. Wang Qi, Tang Shi2wei, Yang Dong2qing, et al. DOM based automatic extraction of topical informa2 tion from Web pages [J ]. Journal of Computer Re2 search and Devolopement, 2004, 141 ( 10 ) : 17862 1792. [ 7 ] Chen J, Zhou B, Shi J, et al. Function2based object model towards website adaptation [ C ] the 10th World Wide Web Conference, Hong Kong, China, 2001. [ 8 ] Lin S H, Ho J M. Discovering informative content blocks from Web documents [ C ] ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Edmonton, Canada, 2002.