Automatic extraction of bibliography with machine learning Takeshi ABEKAWA Hidetsugu NANBA Hiroya TAKAMURA Manabu OKUMURA Abstract In this paper, we propose an extraction method of bibliography using support vector machines. We use visual and linguistic features for extracting bibliography of a paper, and use field order for extracting reference infomation. Our method leads to high precision extraction. 1 WWW CD-ROM e-print archive WWW CiteSeer(Research Index)[3] WWW WWW PRESRI [7] WWW Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology abekawa@lr.pi.titech.ac.jp Faculty of Information Sciences, Hiroshima City University nanba@its.hiroshima-cu.ac.jp Precision and Intelligence Laboratory, Tokyo Institute of Technology {takamura,oku}@pi.titech.ac.jp http://arxiv.org/ http://peter.pi.titech.ac.jp:8000/ 1 DB ( ) ( ) 2 3,4 5 2 1 WWW CD-ROM ( PS PDF) PDF PS Ghostscript ps2pdf PDF PDF pdftohtml XML PDF XML http://www.cs.wisc.edu/ ghost/ http://pdftohtml.sourceforge.net/
pdftohtml PDF PDF ffi fl Introduction References PDF PDF 1 PS WWW or CD-ROM GS XML XML PDF pdftohtml ( ) 1: 3 1 1 [2] [6] (HMM) [5] 1 HMM HMM (SVM) SVM SVM 3.1 SVM SVM x i 2 y i (x i,y i ) n (0 <i<n) : w x + b =0. SVM ( SV ) ( 1/ w ) 1
φ(x) φ(x) d : K(x i, x j )=(x i x j +1) d. 3.2 1 1 1 12 1: TITLE TITLE E AUTHORS AUTHORS E AFFILIATION AFFILIATION E ABSTRACT ABSTRACT E KEYWORD KEYWORD E E EMAIL OTHER 1 E AFFILIATION FAX URL EMAIL 1 AFFILIATION OTHER <TITLE>...</TITLE> <AUTHORS>...</AUTHORS> <ABSTRACT> </ABSTRACT> <ABSTRACT>...</ABSTRACT> <ABSTRACT>...</ABSTRACT> <ABSTRACT>...</ABSTRACT> <ABSTRACT>...</ABSTRACT> <ABSTRACT>...</ABSTRACT> <TITLE_E>Automatic extraction o...</title_e> <AUTHORS_E>Takeshi ABEKAWA...</AUTHORS_E> <ABSTRACT_E>Abstract</ABSTRACT_E> <ABSTRACT_E>In this paper, we...</abstract_e> <ABSTRACT_E>We use visual and...</abstract_e> <ABSTRACT_E>In this paper, we...</abstract_e> <ABSTRACT_E>extracting refere...</abstract_e> 3.3 pdftohtml () (0 x 1) (0 x 1) (0 x 1) 1 0 1 0 5 3 (0,0,1,0,0) 5 (0,0,0,0,1) 1 0
abstract Keyword 2 12 {0, 1} 2 2: [A-Za-z] [0-9] [ -] [ -;:[]{}&/ ] [, () ] @., 3.4 3 () WWW 945 1 5 SVM Yam- Cha 3 YamCha YamCha 100 http://cl.aist-nara.ac.jp/ taku-ku/software/yamcha/ d =2 4.. +4 4 3 A B A+B 2 F-measure Recall,Precision β =1 1 1 4 3.5 TITLE ABSTRACT ABSTRACT ABSTRACT ABSTRACT KEYWORD ABSTRACT KEYWORD KEYWORD EMAIL @ A+B SVM
3: Association for Computational Linguistics(ACL2003) 65 65 0 150 0 Computational Linguistics(COLING2002) 140 140 0 150 0 2003 150 8 142 223 147 65 (2003) 177 1 176 150 236 17 (2003) 208 5 203 152 244 146 155 98 2 96 150 232 WWW 107 73 34 147 96 945 294 651 1122 955 4: A B A+B () () () F F F TITLE 1,215 945 0.962 0.959 0.900 0.884 0.976 0.972 AUTHORS 1,661 940 0.870 0.817 0.835 0.767 0.931 0.899 AFFILIATION 2,124 882 0.838 0.821 0.876 0.805 0.935 0.906 EMAIL 528 323 0.643 0.538 0.964 0.960 0.969 0.960 ABSTRACT 6,777 598 0.954 0.898 0.974 0.910 0.986 0.959 KEYWORD 103 70 0.483 0.361 0.882 0.863 0.909 0.858 OTHER 1,481 651 0.948 0.902 0.938 0.914 0.968 0.932 TITLE E 570 455 0.846 0.830 0.928 0.926 0.960 0.962 AUTHORS E 939 459 0.820 0.747 0.876 0.837 0.925 0.886 AFFILIATION E 722 426 0.802 0.809 0.853 0.834 0.912 0.892 ABSTRACT E 773 99 0.806 0.573 0.851 0.719 0.895 0.794 KEYWORD E 47 37 0.449 0.394 0.790 0.786 0.840 0.786 16,940 945 0.894 0.532 0.920 0.527 0.959 0.692 4 1 1. S. Lawrence, C.L. Giles, K. Bollacker, Digital libraries and autonomous citation indexing, IEEE Computer, vol. 6, no.4, pp. 67-71, 1999. 2. Lawrence, S., Giles, C.L., Bollacker, K.(1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32 (6),67-71. 3. S.Lawrence,C.Giles,K.Bollacker,Digitallibraries andautonomouscitationindexing,ieeecompute r32(6):67-71(1999) 3. PDF 4.1 6 OTHER 1 2 AUTHORS 1 TITLE SOURCE URL
1 DATE September 2003 1 PAGE pp.1 8 2138-2152 pp.34 10-18 OTHER to appear PAGE DATE OTHER 1 NONE <AUTHORS>S. Lawrence, C.L. Giles, K. Bollacker </AUTHORS>, <TITLE>Digital libraries and autonomous citation indexing</title>, <SOURCE> IEEE Computer, vol. 6, no.4 </SOURCE>, <PAGE> pp. 67-71</PAGE>, <DATE> 1999</DATE>. 4.2 HMM SVM HMM [1, 4] HMM 3 PDF 1 HMM 2 q i q j c(q i q j ) c(q σ k ) : c(q i q j ) P (q i q j )= q i,q j Q c(q i q j ) c(q i σ k ) P (q i σ k )= σ k Σ c(q i σ k ) 2 Viterbi 2 HMM HMM AUTHORS 1 [4] DATE DATE PAGE HMM DATE PAGE 2 start AUTHORS TITLE SOURCE 2: HMM OTHER 5 29/2077= 1.4% end
6: AUTHORS DATE TITLE J. Connan and C.W. Omlin ( 2000 ) Bibliography Extraction with Hidden Markov Models. AUTHORS DATE TITLE SOURCE 5: (DATE,PAGE ) 2 1670 AUTHORS, TITLE, SOURCE 138 AUTHORS, TITLE, SOURCE, OTHER 107 AUTHORS, SOURCE 40 AUTHORS, TITLE 38 TITLE, SOURCE 23 SOURCE 15 AUTHORS, SOURCE, OTHER 6 AUTHORS, TITLE, OTHER 6 TITLE 2 TITLE, SOURCE, OTHER 2 SOURCE, OTHER 2 15 TITLE, AUTHORS, SOURCE 6 AUTHORS, SOURCE, TITLE 2 AUTHORS, TITLE, OTHER, SOURCE 1 TITLE, OTHER, SOURCE 1 AUTHORS, SOURCE, TITLE, OTHER 1 SOURCE, TITLE 1 AUTHORS, OTHER, SOURCE 1 AUTHORS, OTHER, TITLE, SOURCE 1 AUTHORS 4.3 SVM HMM SVM HMM HMM 7 7 SVM d 3 3 SVM1 SVM2 HMM DATE, PAGE SVM3 SVM2 SVM3 SVM1,SVM2 SVM SVM3 1 HMM 4.4 1 6 (ex. AUTHORS) 4.5 3 5 7 4.6 HMM SVM1, SVM2 HMM HMM SVM HMM
7: HMM SVM1 SVM2 SVM3 HMM SVM1 SVM2 SVM3 AUTHORS 919 0.913 0.897 0.903 0.903 1084 0.907 0.893 0.898 0.981 TITLE 883 0.818 0.818 0.824 0.840 1044 0.785 0.834 0.839 0.941 SOURCE 923 0.756 0.756 0.794 0.805 1100 0.674 0.743 0.830 0.848 DATE 853 0.988 0.942 0.988 0.988 1061 0.957 0.886 0.957 0.957 PAGE 465 0.989 0.945 0.989 0.989 652 0.956 0.868 0.956 0.956 OTHER 64 0.538 0.313 0.313 0.201 106 0.538 0.538 0.769 0.461 955 0.738 0.706 0.732 0.748 1122 0.651 0.700 0.781 0.816 SVM3 HMM SVM AUTHORS,TITLE AUTHORS TITLE AUTHORS HMM 8 15( 7.. +7) 8: AUTHORS,TITLE,SOURCE 43.44 17.74 30.93 14.01 5 [1] J. Connan and C.W. Omlin. Bibliography extraction with hidden markov models. 2000. [2] Ying Ding, Gobinda Chowdhury, and Schubert Foo. Template mining for the extraction of citation from digital documents. In Proceedings of Second Asian Digital Library Conference, pp. 47 62, 1999. [3] Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, Vol. 32, No. 6, pp. 67 71, 1999. [4] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Building domain-specific search engines with machine learning techniques. In Proceedings of AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace, 1999., 1999. [5] Kristie Seymore, Andrew McCallum, and Roni Rosenfeld. Learning hidden Markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999. [6],,. PDF. 65, pp. 2 229 2 230, 2003. [7],.., Vol.6, No. 5, pp. 43 62, 1999.