25 2 2006 4 Vol. 25 2 April 2006 1) 1 2 1 (1. 100871 ; 2. 730000) : URL Heaps URL Zipf URL URL User Behavior Analysis for a Large2scale Search Engine Wang Jimin 1 2 and Peng Bo 1 (1. School of Electronics Engineering and Computer Science Peking University Beijing 100871 ; 21 Information Center for Resources and Environment Science CAS Lanzhou 730000) Abstract Tianwang Search Engine is a large2scale search engine system which is now maintaining index of about 240 millions web pages and 20 millions ftp files. In this paper we analyze the clickthrough data in the click log of the WWW search service of Tianwang. The results show that the number of unique URLs selected by users conforms to Heaps law and the popularity versus rank for the URLs selected by users is well fit by a Zipf2like distribution. The frequency of the URLs selected by users is correlated to their page size. The clicking of URLs also present high degree of locality. For a given query a new and effective algorithm is presented to find the related queries. All these research results are very important to improve the effectiveness and efficiency of the search engine system and to the research on the search behavior of the users. Keywords search engine click log user behavior characteristic distribution similar query. 1 (WWW Web) Web http 2004 12 Google 80. 58 2004 2 (CNNIC) [1 ] : 3112 : 2005 6 10 : 1966 Web E2mail : wjm @net. pku. edu. cn 1975 E2mail : pb @net. pku. edu. cn 1) (60435020) ; (20030001076) ; (2004036182)
2 155 Web [2 ] :2 ;3 ;4 [3 ] ;5 Email ;6 85 % Web WWW WWW : 2 [8 ] 1997 10 CERNET Web ( IP ) URL URL IP URL 2004 10 ( ) 2185 ( ) 20 10 1 [2 47 ] : 2. 2 2. 4 ; Poisson 1 ; Pareto 216575512021197159121288071 : Mon Sep 1 00 :00 :00 2003 URL ( )? IP ( xxx1xxx1xxx1xxx? )?? http :ΠΠ20211061182. 194Πsearch 2003 URL dirπjyπzkπcrπindex. html : 34075652 URL URL ; 7 URL ;URL ; URL URL URL URL 2003 9 1 10 31 Ξ 2 : () URL URL URL Cookie IP Ξ YQ2CLICKLOG10390210
156 25 URL 75 %URL IP 1Π4 URL 2 2003 9 10 6 583 486 URL 2 251 115 702 067 IP 347 817 IP 40 1 3 URL 311 URL Heaps Heaps N M M = C 3 N a C a 0 < a < 1 [2 ] 2003 9 1 k URL URL (1 k 60) 10 30 60 312 URL 1 URL URL ln M = aln N + b : a = 0177 b = 2146 M = C 3 N 0177 URL 225 URL URL Heaps 219 URL 123 URL URL 5418 % 4512 % URL URL URL 5 : (1) URL URL URL URL 2 URL URL 7616 % 2 7616 % 015 % lnf r = aln r + b (2) URL 70 % 1 URL Heaps 2 URL 658 11 a = - 0176 b = 1019
2 157 f r = C 3 r - 0176 URL - Zipf Zipf [3 ] URL 313 URL - URL URL 10 URL 4 URL - : (1) URL X 1213k (2) 10 1615k 912k lnx 62 % ;20 4 76 % ;30 Y= exp (lnx) 83 %(3) : 83 % URL 1 f ( x) = Zipf 3 x 2 exp - (ln x - ) 2 2 2 y = cπx 1106 a = 1106 lnx lnx [4 1 ] = 9104 = 0188 URL 3 URL - 314 URL URL 2311k 4517 k 1315 k lnx := 915277 = 019591 URL URL 10 URL 34 k 19k URL URL URL 3 10 10000 5 5 URL 3 URL
158 25 URL 3 URL 10 URL URL 1Π4 ; 1Π5 1Π6 ; URL 216 % URL 6 URL 80 % ; 1014 % 3 URL URL ( %) 314 4214 1914 2318 615 4119 2510 8019 217 3919 418 1014 5 1419 5411 216 2215 1012 4415 714 2411 219 1812 ( %) 4 URL ( ) [9 ] 6 Web (Web Infomall) [10 ] 315 URL Web 2002 Infomall URL [3 ] 300 Web Infomall URL
2 159 411 URL 30 URL 7 :0 60 10 100 3 : (1) 5 4 6 7 (2) 1 0 2 (3) 12 URL 24 48 72 96 12 60 84 24 ) ; 12 (2) 50 % 10 (600 8 60 URL URL 10 : ) ; ) ; ) ; (1) 32 % 5 (300 (3) 80 % 30 (1800 (4) 90 % 50 (3000 (5) 1012 2118 URL 10 URL Weibull : p = F( x a b) = x abt b- 1 e - atb dt 0 7 URL = 1 - e - axb I (0 ) ( x) a b [010054 010089 ] [0164 0171 ] 10 URL 412 URL URL 213 [4 ] [5 ] 7 60 5 000 URL 21 % 8 3 7 10
160 25 9 URL H (Hurst ) H = 1 - Π2 :X = { X t t = 0 1 2 } ( H ) X = E[ X i ] 2 = E[ ( X i - ) 2 ] 500 URL ( k) = E [ ( X i - ) ( X i + k - ) ]Π 2 ( k = 0 1 URL 2 ) k X ( m) = ( X ( m) 1 X ( m) 2 ) ( m 9 (a) log ( m) log ( Var = 1 2 ) m X ( m) X ( m) )ΠVar( X) k = ( X km - m + 1 + + X km )Πm ( k 1) - 0126 b = 0126 hurst H = X ( m) ( m) ( k ) ( m) ( k ) = 1 - bπ2 = 0187 hurst ( k) k 0 m = 1 2 3 X 0167 ( [ 4 ]) URL ( m) ( k) ( k) k 0 m = 1 2 3 X 1000 1500 2000 16000 URL m 1Π2 < H < 1 URL 9 (b) (c) 500 k 1000 2000 URL URL : ( k) = 5 k = 0 : RΠS Higuchi X ( m) log ( Var X ( m) )ΠVar Google ( X) log ( m) - (0 1)
2 161 M p j q 1 j URL : (1) (2) URL Ξ [11 ] Beeferman [11 ] [11 ] [12 ] - URL (Agglomerative) URL URL URL 1000 20 10 : : q 1 : q 1 k Step 11 q 1 URL U Step 21 U URL URL q i W ) ; Step 31 W q i - (SVD) q i URL URL U Step 41 URL - M Step 51 (1) q i ( q 1 10 ) Step 61 k bbs bbs 1 bbs W { q 1 q m } URL bbs U {URL 1 URL n } - M m 3 n m ij i q i j URL j 2 U U Similar( q i q 1 ) = min( m( q ij ) m ( q 1j )) j =1 U m( q 1 ) 3 p j m ( q 1 ) q 1 URL (1) 82 % 3 10 ( (LSI) 3 4 10 Ξ : 87 %
162 25 6 5 Xie Yinglian O Hallaron D. Locality in search engine queries and its implications for caching. In : Proc. IEEE Infocom. URL 2002 : 6 Silverstein C Henzinger M Marais H et al. Analysis of a URL Heaps ; URL very large AltaVista query log. SRC Technical Note 1998-016 1998 URL 7 Spink A Wolfram D Jansen B J et al. Searching the web : Zipf ; URL The public and their queries. Journal of the American Society for Information Science 2001 53 (2) : 226234 URL 8 ( Tianwang Search Engine). http :ΠΠ e. pku. edu. cn 9 Cho J. Crawling the Web : Discovery and Maintenance of a Large2Scale Web Data. [ Ph. D. dissertation ] Stanford University 2001 10 Web (Chinese Web Infomall. http :ΠΠwww. infomall. cnπ 11 Beeferman D Berger A. Agglomerative clustering of a search 1 ( China Internet Network Information Center CNNIC) http :ΠΠwww. cnnic. net. cnπ 2 Baldi P Frasconi P Smyth P. Modeling the Internet and the Web probabilistic methods and algorithms. Wiley 2003 England : John 3.. :. 2005 4 Wang Jianyong Li Xiaoming Shan Songwei. Web search engine : characteristics of user behaviors and their implication [J ]. Science in China (Series F) 2001 44 (5) :351365 engine query log. In : Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2000 407416 12 Wen Ji2Rong Nie Jian2Yun Zhang Hong2Jiang. Query clustering using userlogs. In : Proceedings of the 10th World Wide Web Conference New York : ACM Press 2001 162 168 ()