DEWS2008 A7-5 Wikipedia Web AdamJatowt 606-8501 606-8501 E-mail: {nakatani,tezuka,adam,tanaka}@dl.kuis.kyoto-u.ac.jp Wikipedia Web Web Wikipedia Wikipedia Abstract Topic Structure Mining based on Wikipedia and Web Search Makoto NAKATANI, Taro TEZUKA, Adam JATOWT, and Katsumi TANAKA Department of Informatics and Mathematical Science, Faculty of Engineering, Kyoto University Yoshida-honmachi, Sakyo, Kyoto, 606-8501 Japan Department of Social Informatics, Graduate School of Informatics, Kyoto University Yoshida-honmachi, Sakyo, Kyoto, 606-8501 Japan E-mail: {nakatani,tezuka,adam,tanaka}@dl.kuis.kyoto-u.ac.jp This paper proposes a method for extracting topic terms and analyzing characteristics of topic terms by using the structual features of Wikipedia, the free encyclopedia, and Web search. Existing method can t extract topic terms for multiple terms, and extracted terms contain both general terms and special terms for user s requests. Our method enables to extract topic terms for multiple terms by improving existing methods, and analyze generality and specialty of each topic terms by using the structural features of Wikipedia like sections and links. It supports user s understanding of the topic by showing how widely or narrowly the terms are used. Key words Wikipedia, Topic structure, Topic extraction 1. Web Web Google 1 Yahoo! 2 Web Web Web 1 http://www.google.co.jp/ 2 http://www.yahoo.co.jp/ Web Web 2
1 Wikipedia 3 Wikipedia Wikipedia Wikipedia 2 3 Wikipedia 4 1 2 5 6 Web 2. Wikipedia Web [1] Wikipedia [2] Wikipedia DOM Wikipedia Web ipod Wikipedia Web ipod nano cm [3] [4] [5] ipod cm Rada [6] Wikipedia cm Web Wikipedia 1 Web Wikipedia Wikipedia Web Web ipod Web ipod cm [7] [8] Web Web [7] Web A A B A A B A B Web 3 http://ja.wikipedia.org/
& ' (! #" $ % 4. 1. 1 Wikipedia 2 Wikipedia Wikipedia 3. Wikipedia 3. 1 Wikipedia 1 Wikipedia 2008 1 2 Wikipedia 217 3 3. 2 2 Wikipedia Wikipedia (2) wiki {q 1, q 2 } Wikipedia d q1 q 2 q 2 Wikiepdia q 1 80 q 2 q 2 2500 Wikipedia {t 1, t 2,...t n} t Wikipedia d t 4. 1. 2 Wikipedia 3. 3 {t 1, t 2,..., t n} Wikipedia Wikipedia 2 Wikipedia ipod 2 4 PORTER PRADA 4. ipod foobar ipodwizard 4. 1 Web Web Wikipedia [7] Wikipedia Web P 2 A,B Web P p A p A 1 4 http://download.wikimedia.org/jawiki/ 1 2 2 Class B 1 Class B 2 Total Class A 1 x 11 x 12 a 1 Class A 2 x 21 x 22 a 2 Total b 1 b 2 N Wikipedia Web A 2 B
B 1 B 2 1 4. 2 a 1,a 2,b 1,b 2 A 1,A 2,B 1,B 2 Web 4. 2. 1 Wikipedia Web P Wikipedia N A i B j Web x ij(i = 1, 2; j = 1, 2) 2 A,B q t 2 2 χ 2 (x ij a i b j /N) 2 Wikipedia d q = (1) a i b j /N i=1 j=1 t 1 χ 2 α A,B A,B t Wikipedia d q t Wikipedia q 1 q 2 q 1,q 2 t Wikipedia d t Web P A,B / 2 3 / 2 1 1 N,a 1,b 1,x 11 DF Wikipeida Wikipedia d N d {s 1, s 2,..., s Nd } N = DF (intitle(q 1 ) q 2 ) Wikipedia d t a 1 = DF (intitle(q 1 q 2 )) Noise(t, d) Signal(t, d) (2) b 1 = DF (intitle(q 1) q 2 t) x 11 = DF (intitle(q 1 q 2 ) t) freq(t, s i, d) p(s i t, d) = Nd A,B freq(t, s (4) i=1 i, d) (1) N d Noise(t, d) = p(s DF (intitle(q 1 q 2) t) DF (intitle(q1) q2 t) i t, d) log 2 p(s i t, d) (5) > (3) i=1 DF (intitle(q 1 q 2 )) DF (intitle(q 1 ) q 2 ) χ 2 0 > u 5 N (3) d Signal(t, d) = log 2 freq(t, s i, d) N(t, d) (6) t 2,3 i=1 freq(t, s i, d) Wikipedia d s i t p(s i t, d) Wikipedia 2 DF (intitle(q 1 q 2 ) t) DF (intitle(q 1 q 2)) > DF (q 1 q 2 t) d t DF (q 1 q 2) s i Noise(t, d) N = DF (q 1 q 2) t d a 1 = DF (intitle(q 1 q 2 )) Noise(t, d) b 1 = DF (q 1 q 2 t) t x 11 = DF (intitle(q 1 q 2 ) t) Signal(t, d) Noise(t, d) 3 DF (q 1 intitle(q 2) t) DF (q1 q2 t) 1 DF (q 1 intitle(q 2 )) > DF (q 1 q 2 ) N = DF (q 1 q 2 ) a 1 = DF (q 1 intitle(q 2)) / b 1 = DF (q 1 q 2 t) df idf df x 11 = DF (q 1 intitle(q 2 ) t) idf df 5 T 1 (x) dx = α, T 1 (x) = 1 x 1 2 e x 2 u 2π Noise(t, d) Signal(t, d)
6 6 555 : : 6 555 555 8!! 4 : : 8?! 4! 2 P A B 1 q 1 q 2 q 2 t 2 q 1,q 2 q 1,q 2 t 3 q 1,q 2 q 2 t 3 RelativeInlink(d t, d q) Inlink(d t) (q= ipod ) t RelativeInlink(d t, d q) Inlink(d t) 27 639 22 1320 14 1563 12 1380 Macintosh 11 675 10 14753 7 5957 ipod nano 6 64 ipod touch 1 38 8 7 8 9 8 ; <>= 8?? 021 +3"$#$%'& - %"$#$%/& () "*+$#$, ).- % "$#$%'& df idf 3 Wikipedia 4. 2. 2 q d q {t 1, t 2,..., t n} D(q) = {d t1, d t2,..., d tn } t t d t D(q) t t t d t 4. 2. 3 D(q) t Wikipedia t d t D(q) q Wikipedia d q d q RelativeInlink RelativeInlink(d t, d q ) = RelativeInlink(d t, d q ) {t i t t i, d ti d t, t i T opict erms(q)} q t Generality(t, q) d tk d tl d tk d tl RelativeInlink Wikipedia d t q t Inlink(d t) 3 ipod Wikipedia q t RelativeInlink(d t, d q ) t Locality(t, q) Inlink(d t) Wikipedia 3 Locality(t, q) = Signal(t, d q ) (8) Inlink(d t ) q t d q ipod Inlink(d t ) q t RelativeInlink(d t, d q) Specialty(t, q) ipod RelativeInlink(dt, dq) Specialty(t, q) = (9) Inlink(d t ) RelativeInlink(d t, d q) d q Outlink(d q ) T opict erms(q) Wikipedia t 0 1 t t Inlink(d t) T opict erms(q) RelativeInlink(dt, dq) Generality(t, q) = Noise(t, d q ) (7) Wikipedia d q Outlink(d q )
S H 5. 4 q 4. 1 1 q 2 ipod,,,,, 4. 2 cm,,,,,,,,, vaio,,dvd,,,,,,, 5. 1 4. 1,,,,, 4,,,,,,,,, Yahoo! Web,,,, API 6 q 2 q 1 =?> @&A! #"%$&$#' ()+*-,/. 021 4. 1 3 BDCE 3457680:9<; Wikipedia Wikipedia 3 DF Google Web Google SOAP API 7 Google JLKNMPOQ?R J?KMPOQ?R J?KMPOQ?R intitle: α 0.05 4 F<G Wikipedia 4. 2 4 t DF (t q) 4 confidence(t q) = (10) DF (t) Yahoo! q Google SOAP API 100 Web Specialty(t, q) 100 confidence(t q) S S ipod 5 Generality 5. 2 Locality 5. 2. 1 Generality(t, q) Locality(t, q) 100 t Noise(t, S) 5 Signal(t, S) Generality(t, q) Locality(t, q) Noise(t, S) 1 Signal(t, S) 4. 1 Web t q 1 Specialty(t, q) t q t q 1 6 6 http://developer.yahoo.co.jp/search/webunit /V1/webunitSearch.html 7 http://code.google.com/apis/soapsearch/,,,, ipod cm ipod cm ipod ipod F<G I TU
7 5 1 2 3 precision 0.5158 0.4764 0.3868 recall 0.3135 0.175 0.1577 d=ipod d= Specialty ipod ipod ipod ipod ipod ipod ipod Apple Locality Generality ipod ipod ipod cm ipod ipod cm 5,6,7 Generality Locality Specialty Wikipedia ipod Noise(t, S) Signal(t, S) confidence(t q) 5 20 Generality 20 t Noise(t, S) 7 Wikipedia 4 Web q 2 q 1 Wikipedia 6. Wikipedia Web Wikipedia 3 4. 1 3 1 5. 2. 2 4. 2 ipod Generality(t, q) Web Locality(t, q) Specialty(t, q) 10 8 Generality Wikipedia A B
33-0 ) /+ + )*. () q 1 q 2 6 1 ipod cm N.E.R.D,,,,,,,,,,COACH,GUCCI,PORTER,LOUIS VUITTON,PRADA,Paul Smith,! 2,,,,,,,,,,,, K,,,,,,,,,,,,,,,, q ipod Generality 10 8 ipod nano,ipod mini,ipod shuffle,itunes,mac,macintosh,,ipod touch,mac OS X,iPhone,,,,,,,,, q ipod Loacality 10,HTML,Linux,,FAT32,,U2,FireWire,GTK,HFS+,,,,,,,,, q ipod Specialty 10 ipod Classic,,,,,,,, featuring,,,,,,,,,, 332 *0 1 -./ *+,) '(,1 2,-. 66 45 +3 /01 2 2,. +,*-!#"$&% "! #%$ '& (!#"$&%' 5 Generality 6 Locality 7 Specialty A B [2],, Dom wikipedia, 21 (2007). 3 [3] E. Gabrilovich and S. Markovitch: Computing semantic relatedness using wikipedia-based explicit semantic analysis, Proceedings of The Twentieth International Joint Conference for Artificial Intelligence, pp. 1606 1611 (2007). COE [4] R. Bunescu and M. Pasca: Using encyclopedic knowledge for named entity disambiguation, Proceedings of the European Conference of the Association for Computational Linguistics (2006). [5] M. Strube and S. P. Ponzetto: Wikirelate! computing ( semantic relatedness using wikipedia, Proceedings of the American Association for Artificial Intelligence (2006). 18049041 (B) [6] R. Mihalcea and A. Csomai: Wikify! linking documents 18700086 (B) to encyclopedic knowledge, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 233 242 (2007). 18700111 [7], web, 14 (DEWS2003) (2003). [8],,,, [1],, Wikipedia web,,, 47, 10, pp. Letters, 5, 2 (2006). 2917 2928 (2006).