Journal of Chinese Computer Systems 2014 4 4 Vol 35 No 4 2014 Hadoop 1 1 2 3 1 1 361000 2 518000 3 200000 E-mailchenlin@ xmu edu cn Hadoop Twitter Hadoop TP18 A Mining Hot Event from Microblog with Hadoop 1000-1220201404-0797-05 XIE Si-fa 1 LIN Chen 1 2 SU Xuan 3 JIANG Yi 1 1 School of Information Science and TechnologyXiamen UniversityXiamen 361005 China 2 Shenzhen Research Institute of Xiamen UniversityShenzhen 518000 China 3 Channal trans Network ShanghaiCo Shanghai 20000 China AbstractAs a newly emerging social-networking servicemicroblog has a strong immediate communication function and can release hot issues of society rapidly by various methods However the huge mass of data releasing in a short time leads to the fragmentation of information to some extent Moreover the quick updating of information results in the difficulty of retrieving essential issues In this paperwe propose a distributed algorithm of mining hot spots from Microblog data based on Hadoopwhich is superior in big data miningand detect hot issues according to the extracted spots for users' searching convenience Furthermorewe put forward the detecting algorithm with a linear time complexitydetecting the time period of the burst of the hot issues The experiments on Twitter and Sina Weibo show that our algorithm can extract hot issues from microblog effectively Key wordsmicrobloghadoopdistributedhot event 1 6 TDT 7 8 9 Twitter 10 Twitter 11 12-15 " " " " K-Means topic detection and tracking TDT 1-3 2 TDT Allan 4 2 1 5 MT C Agarwal T MT T C 2013-01-25 2013-03-02 61102136 61001013 2011J05158 JCYJ20120618155655087 1989 1982 Web 1982 1960
798 2014 WS W Fs W BS Map WS ti Fs = f1 f2 fn W fi fi value ti W key w j j 16 fj Reduce W Bi fi - W μ - 2σ μ σ MapReduce BS W Bs W Bs = BL b1 b2 bn bi 2 BS 2 2 WS BS TL W BL Map 1 for i = 1 to TL Length do 2 for j = i + 1 to i + w do MT 3 return key j TL j TL i WS 4 end for BS 17 WKSC 5 end for Reduce 6 InitBL 7 μ w value i 1 2 4 Fig 1 Flow char of hot event detecting WKSC 17 WKSC Haar 2 3 WS BS MT T C MT Map MT C W TL 0 1 WS MT WS WL MT list L L TL Map T 4 1 WL IKAnalyzerMT c 2 For i = 1 to WL Length do 3 InitTL 4 j GetIndexMT t 5 TL j 1 st et Ls Rs 6 return WL i TL 7 end for 3 Reduce BL 8 InitTL L 9 for i = 1 to the count of value do 1 InitL/ / L 10 TL + = valuei 2 for i = 1 to BL Length 11 returnkey TL 3 Cs' Cs 12 end for 4 Cs Cs + BL i T jtl j= 1 5 If BL i< 0 W key TL value Reduce Map 6 Continue 7 Temp i Cs' Cs key value svalue 8 whiletruedo key svalue WS 9 Merge null WS Map-Reduce WS 10 for I = LL Lengthto L1do i = 1 8 σ 2 w value i - μ 2 i = 1 9 BL j TL j-μ-2σ 10 returnkey BL Cs
4 Hadoop 799 11 if I Ls < Temp Ls Hadoop 12 Merge I / / < Cs' Hadoop 1 13 break 14 end if 15 end for I /O 16 ifmerge = = null Merge = null&&merge Rs > Temp Rs 1 Hadoop 17 L AddTemp 18 break 19 Temp st = Merge st Hadoop 20 Temp Ls = Merge Ls 21 DeleteMerge Twitter 22 end if Hadoop 23 end for while 24 end if 3 2 2011 1 23 2 8 Twitter 2 2G 1 416s 1322s 2009 8 2012 5 3 3G Hadoop 8 384s 1069s Hadoop 1 0 1 8 CPU 4 8 InterRCoreTMi7 3900s 9700s 1T 64G 2-4 32G Table 1 6 Twitter 1 Time performance of different processing nodes Twittter 4 405s 1199s 5 2 Fig 2 Japans nuclear leak Fig 3 3 Shanghai's World Expo 2-6 2 WKSC rally 3
800 2014 2 4 Fig 4 Bin Laden is shoot dead 5 Fig 5 Egypt riot 2 Table 2 Fig 6 6 Korea's world athletics championship 2011 Burst time of hot event 2010 7 2011 3 2010 3 2010 8 1 2010 12 2011 3 2011 5 2 2011 5 3 1 28 4 2 3 1 29 5 2 6 2011 1 2011 5 2010 1 2010 7 2010 9 2011 2 2011 5 1 28 2 3 1 29 2 6 2 7 5 2010 1 4 Hadoop Twitter Twitter 3 Hadoop bigram trigram
4 Hadoop 801 References 1Li Hong Wei Jin-feng Netnews bursty hot topic detection based on butsty featurec Proceedings of International Confernece on E- Business and E-GovernmentWashington DC USAIEEE 2010 1437-1440 2Holz F Teresniak S Towards automatic detection and tracking of topic changem Computational Linguistic and Intelligent Text Berlin GermanySpringer-Verlag 2010327-339 3Jing Qiu Liao Le-jian Dong Xiu-jie Topic detcetion and tracking for Chinese news web pagesc Proceedings of Seventh Internation Conference on Advanced Language Processing and Web Information Technology Washington DC USAIEEE Computer Society 2008114-120 4Allan J Papka R Lavrenko V On-line new event detection and trackingc Sigir 98 Proceedings of 21th ACM SIGIR International Conference on Research and Development in Information Retrieval New YorkACM 199837-45 5Wu Yong-huiWang Xiao-long Ding Yu-xin et al Adaptive online web topic detection method for web news recommendation system J Acta Electronica Sinica 2010 38112620-2624 6Manoj K Agarwal Krithi RamamrithamManish Bhide Real time discovery of dense clusters in highly dynamic graphsidentifying real world events in highly dynamic environmentsc Proceedings of the VLDB EndowmentVery Large Data Base Endowment Inc VLDB 2012 510980-991 7Lin Chen Lin Chun Li Jing-xuan et al Generating event storyline from microblogsc Proceedings of the 21st ACM Conference on Information and Knowledge Management CIKM 2012175-184 8Sasa PetrovicMiles OsborneVictor Lavrenko Streaming first story detection with application to twitterc The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics HLT-NAACL 2010181-189 9Efron M Information search and retrieval in microblogsj Journal of the American Society for Information Science and Technology June 2011 626 996-1008 10Mathioudakis M Koudas N Twittermonitortrend detection over the twitter streamc Proceedings of the 2010 International Conference on Management of Data SIGMOD 2010 New York ACM 20101155-1158 11Takamura H Yokono H Okumura M Summarizing a document streamm Advances in Information Retrieval Springer Berlin Heidelberg 2011177-188 12Sakaki T Okazaki M Matsuo Y Earthquake shakes twitter users real-time event detection by social sensorsc Proceedings of the 19th International Conference on World Wide WebWWW 2010 2010851-860 13Shamma D A Kennedy L Churchill E F Peaks and persistence modeling the shape of microblog conversationsc Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work CSCW '112011355-358 14Li Jin Zhang HuaWu Hao-xiong et al BTopicMinerdomainspecific topic mining system for Chinese microblog J Journal of Computer Applications 2012 328 2346-2349 15Weng Jian-shu Bu-Sung Lee Event detection in twitterc In Proceedings of the Fifth Annual Conference on Weblogs and Social Media ICWSM 20112011401-408 16Yao Jun-jie Cui Bin Huang Yu-xin et al Bursty event detection from collaborative tagsc World Wide Web2012 2012 15 171-195 17Han Zhong-ming Chen Ni Le Jia-jin et al An efficient and effective clustering algorithm for time series of hot topicj Chinese Journal of Computers 2012 35112337-2347 5 J 2010 38 11 2620-2624 14 BTopicMinerJ 2012 32 8 2346-2349 17 J 2012 35112337-2347