Shortness Ambiguity TEAM Ungrammaticality

DEIM Forum 2017 D8-4 Twitter 305 8573 1 1 1 305 8573 1 1 1 E-mail: s.nagaki@kde.cs.tsukuba.ac.jp, kitagawa@cs.tsukuba.ac.jp Twitter Twitter 1 Twitter 1. Twitter Wikipedia 2 Twitter 1 1 Twitter [1] Twitter Shortness 140 Ambiguity Ungrammaticality 1 1 https://twitter.com/ 1: TAGME+. Twitter Twitter TEAM (Temporal Entity Association Measure) TEAM, TEAM TAGME+ TAGME+ 1

2:. (1) (2) (3) 3. 1 3 Twitter TEAM TEAM Twitter TEAM TEAM AUC (Area Under the Curve) 60% 2. (1) (2) (3) 3 2 2. 1 1 [2] 2. 2 2. 1 2. 3 2. 2

3. TAGME Twitter TAGME [3] TAGME Wikipedia 1 Wikipedia TAGME 2010 Google Faculty Award 2 3. 1 1 2 Twitter Twitter. TAGME 2. (1) (2) (3) 3. 3. 1 Wikipedia 2 k 1, k 2 k 1 k 2 Wikipedia lp(k) = link(k) freq(k), (1) 2 https://tagme.d4science.org/tagme/tagme help.html 3 283 2016 11 8 Google Scholar k link(k) k Wikipedia freq(k) k Wikipedia k 1 k 2 lp(k 1) < lp(k 2) k 2 lp(k 1) > = lp(k 2) k 1 k 2 3. 2 3. 1 Wikipedia k K Wikipedia P ages(k) TAGME, e k P ages(k) Wikipedia Link-based Measure WLM [4] WLM Wikipedia TAGME [2] WLM Wikipedia a, b W LM(a, b) = 1 log max ( A, B ) log A B. (2) log All log min ( A, B ) A, B a, b All Wikipedia WLM k e k rel k (e k ) = p k P ages(k ) W LM(pk, ek) P r(p k k ), P ages(k ) k K\{k} P r(e k k) k e k 3 e k p k TAGME rel k (e k ) ϵ% P r(e k k) e k k 3. 3 (3) 3. 2 k, e k coherence(k, e k ) = 1 S 1 e k S\{e k} W LM(e k, p k ), (4) S 1 2 (lp(k) + coherence(k, e k)) > ρ, (5)

4. 4. 1 T TEAM (Temporal Entity Association Measure) 4. 2 TEAM TAGME TAGME+ 4. 1 TEAM Twitter TEAM (Temporal Entity Association Measure) TEAM. Temporal Entity Association Measure W a, b A, B W All p(a, b) W A B A B W p(a), p(b) W A, B A W, B W a, b W Temporal Entity Association Measure TEAM-jaccard TEAM-dice TEAM-npmi TEAM-gsd T EAM jac(a, b, W ) = T EAM dice (a, b, W ) = T EAM npmi (a, b, W ) = A B A B. (6) 2 A B A + B. (7) ln p(a,b) p(a)p(b) ln p(a, b). (8) log max ( A, B ) log A B T EAM gsd (a, b, W ) = 1. log All log min ( A, B ) (9) 9 0 T EAM gsd (a, b, W ) = 0 TEAM-jaccard 2 Jaccard 2 X, Y Jaccard Jac(X, Y ) = X Y X Y. (10) 10 0 < = Jac(X, Y ) < = 1 TEAMjacard 1 a, b W TEAM-dice 2 Dice 2 x, y Dice Dice(x, y) = 2 X Y X + Y. (11) 11 0 < = Dice(x, y) < = 1 TEAMjaccard TEAM-dice 1 a, b W Jaccard TEAM-pmi pointwise mutual information (PMI). PMI 2 X, Y x, y P MI(X = x, Y = y) = ln P (X = x, Y = y) P (X = x)p (Y = y). (12) PMI < P MI(X = x, Y = y) < TEAM-pmi PMI Normalized PMI [5] Normalized PMI NP MI(X = x, Y = y) = ln P (X=x,Y =y) P (X=x)P (Y =y) ln P (X = x, Y = y). (13) Jaccard Dice PMI [6] [7] [8] TEAM-gsd Google Similarity Distance (GSD) [9] GSD 2 Google 2 x, y GSD GSD(x, y) = 1 log max ( X, Y ) log X Y. (14) log All log min ( X, Y ) X, Y x, y Google All Google GSD < GSD(x, y) < = 1 14 0 T EAM gsd = 0 T EAM gsd 0 < = T EAM gsd < = 1. TEAM 5. 4. 2 TAGME+ 3. TAGME WLM TEAM TAGME TAGME+ TEAM WLM τ w τ w W TEAM WLM

Algorithm 1: TAGME+ Input: T L, w, D, ρ, λ, ϵ Output: EL 1 for tweet T from T S do 2 i id(t ), τ time(t ), t text(t ) 3 EW shavew indow(τ, w) 4 K extractkeywords(t ) 5 for k from K do 6 for e k from Candidates(k) do 7 rel ek calcrel(k, e k, K, EW, λ) 8 Rel ek Rel ek {(e k, rel ek )} 9 end 10 E k toprel(rel ek, ϵ) 11 e k arg max ek E k P r(e k k) 12 S.enqueue((k, e k )) 13 end 14 for (k, e k ) from S do 15 coh k,ek calccoherence(k, e k ) 16 score 1 2 (kp(k) + coh(k, e k)) 17 if score > ρ then 18 EL EL {(i, k, e k )} 19 EW updatew indow(i, τ, k, e k ) 20 end 21 end 22 end 23 return EL rel τ (a, b) = (1 λ) W LM(a, b)+λ T EAM(a, b, W ). (15) λ, WLM TEAM. TAGME WLM 3 4 WLM 15 TAGME+ 1 TAGME+ T S w ρ, λ, ϵ ID 3 TAGME+ 3 EW (1) ID, (2) (3)(2) 3 shavew indow(τ, w) EW τ w 3. 1 4 TAGME 1 [10] 1 Mihalcea [11] Keyphraseness 3 15 6 13 3. 3 5 Keyphraseness 5. 4. 1 TEAM 1 TEAM 5. 2 2 TEAM 5. 3 3 w 5. 4 5. 1 5. 2, 5. 3, 5. 4 5. 5 5. 1 WWW2016 #Microposts2016 4 NEEL(Named Entity recognition and Linking) Challenge 2011 2015 Twitter Firehose 100 3,164 6,025 9,289 9,289 2016 12 26 Twitter REST API 5 1,486 DBpedia 6 SPARQL1.1 7 DBpedia Wikipedia Wikipedia 2016 12 1 8 Twitter 2011 2015 2 2 tweets2011 tweets2015 (1) (2) (3) 1 4 http://microposts2016.seas.upenn.edu/index.html 5 https://dev.twitter.com/rest/public 6 http://wiki.dbpedia.org/ 7 https://www.w3.org/tr/sparql11-query/ 8 https://dumps.wikimedia.org/enwiki/

1: tweets2011 tweets2015 689 797 636 70 1.97 / 0.16 / / 9 / 5 / 0 / 0 / 17.42 13.03 2011 7 16 2015 12 15 8 15 12 16 TEAM-jaccard (a) tweets2011 TEAM-gsm TEAM-jaccard TEAM-gsm (a) tweets2011 (b) tweets2015 3: TEAM x y tweets2011 tweets2015 tweets2011 TEAM-gsm (b) tweets2015 4: TEAM x y 15 λ WLM TEAM TEAM TEAM TEAM TAGME-kp [10] ϵ TAGME [3] ϵ = 30(%) 5 ρ 0 1 0.5 PR 5. 2 TEAM TEAM 4. 1 TEAM-jaccard, TEAM-dice, TEAMnpmi, TEAM-gsm PR λ = 0.5, w = 6hours 3 x y tweets2011 tweets2015 tweets2015 1 tweets2011 TEAM-GSM a, b TEAM-jaccard, TEAM-dice, TEAM-npmi 0 TEAMgsm TEAM-jaccard TEAM-gsm 2 5. 3 TEAM TEAM 15 λ 0 1 0.1 PR λ = 0 w = 6hours 4 15 λ WLM TEAM x y tweets2011 λ = 0 PR-AUC 0.20 λ = 1 TEAM-jaccard PR-AUC 0.32 TEAM-gsm PR-AUC 0.28 PR-AUC 60% TEAM tweets2011 TEAM-jaccard, TEAM-gsm λ = 1.0 PR-AUC PR λ = 1.0 15 TEAM WLM TEAM λ

TEAM-jaccard TEAM-gsm (a) tweets2011 TEAM-jaccard TEAM-gsm (b) tweets2015 5: x y tweets2011 w = 1day ρ tweets2015 5. 2 5. 2 1 5. 4 w w w 0 1day 0 1month 2 w = 0 (None) WLM TEAM λ = 0.5 5 tweets2015 tweets2011 TEAM-jaccard TEAM-gsm w = 1day PR 15 TEAM w 1day Twitter 5. 5 tweets2011 TEAM 5. 3 TEAM λ w = 1day Twitter TEAM-jaccard TAGME+ TEAM-gsm TAGME+ TEAM-jaccard TAGME+ tweets2015 TEAM tweets2015 1 6. (1)140 (2) (3) Twitter Liu [12]

Shen [13] KAURI KAURI KAURI Twitter Fang [14] Twitter GPS 1.6% [15] Fang 25% Fang Hua [16] (1) (2) Hua Twitter Wikipedia Hua 7. TEAM TEAM TEAM Twitter 2. 1 k tweets2015 1 [1] G. Rizzo, M. van Erp, and R. Troncy. Benchmarking the extraction and disambiguation of named entities on the semantic web. In Proc. 9th International Conference on Language Resources and Evaluation, LREC 14, Reykjavik, Iceland, 2014. [2] W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, Vol. 27, No. 2, pp. 443 460, 2015. [3] P. Ferragina and U. Scaiella. Fast and accurate annotation of short texts with wikipedia pages. IEEE Software, Vol. 1, No. 29, pp. 70 75, 2012. [4] I. Witten and D. Milne. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proc. AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, AAAI Press, pp. 25 30, Chicago, USA, 2008. [5] G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proc. the Biennial GSCL Conference, Vol. 156, 2009. [6] D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using web search engines. In Proc. the 16th International Conference on World Wide Web, WWW 07, pp. 757 766, New York, USA, 2007. [7] D. Bollegala, Y. Matsuo, and M. Ishizuka. A web search engine-based approach to measure semantic similarity between words. IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 7, pp. 977 990, 2011. [8] G. Lu, P. Huang, L. He, C. Cu, and X. Li. A new semantic similarity measuring method based on web search engines. W. Trans. on Comp., Vol. 9, No. 1, pp. 1 10, 2010. [9] R. L Cilibrasi and P. MB Vitanyi. The google similarity distance. IEEE Transactions on knowledge and data engineering, Vol. 19, No. 3, 2007. [10],,,. 7, 2015. [11] R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In Proc. the 16th ACM conference on information and knowledge management, pp. 233 242, 2007. [12] X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu. Entity linking for tweets. In ACL (1), pp. 1304 1311, 2013. [13] W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In Proc. the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 68 76, 2013. [14] Y. Fang and M. Chang. Entity linking on microblogs with spatial and temporal signals. Transactions of the Association for Computational Linguistics, Vol. 2, pp. 259 272, 2014. [15] K. Leetaru, S. Wang, G. Cao, A. Padmanabhan, and E. Shook. Mapping the global twitter heartbeat: The geography of twitter. First Monday, Vol. 18, No. 5, 2013. [16] W. Hua, K. Zheng, and X. Zhou. Microblog entity linking with social temporal context. In Proc. the 2015 ACM SIG- MOD International Conference on Management of Data, pp. 1761 1775, 2015. NICT