Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη 6 HΥ63 - Συστήματα Ανάκτησης Πληροφοριών Infomtion Retievl (IR Systems Web Seching I: Histoy nd Bsic Notions, Cwling II: Link Anlysis Techniques III: Web Spm Pge Identifiction Γιάννης Τζίτζικας ιάλεξη : Ημερομηνία : 9 / 3 / 6 Bsed on Z. Gyongyi, H. Gci-Molin, J. Pedesen, Compting Web Spm with Tust Rnk, SIGMOD CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 Κίνητρο Web spm pges use vious techniques to chieve highe-thndeseved nkings in sech engine s esults. Humn expets cn identify spm, but it is too expensive to mnully evlute lge numbe of pges Ανάγκη για αυτόματες ή ημιαυτόματες τεχνικές διαχωρισμού των «καλών» σελίδων από τις «κακές» CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6
Web Spm Ορισμός: διασυνδεδεμένες σελίδες δημιουργημένες για παραπλάνηση των μηχανών αναζήτησης Παραδείγματα ponogphy site pge tht contins thousnds of keywods which e mde invisible (to humns by djusting ccodingly the colo scheme sech engine will include this pge in the esults of quey tht contins some of these keywods cetion of lge numbe of bogus web pges, ll pointing to single tget pge (tht pge will hve high in-degee sech engine will nk high this pge CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 3 Αντιμετώπιση Κλασσικός τρόπος αντιμετώπισης Sech engine compnies typiclly employ stff membes who specilize in the detection of web spm, constntly scnning the web looking fo offendes In cse spm pge is identified, the sech engine stops cwling nd indexing it. Vey expensive nd slow spm detection pocess CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6
Μια ημιαυτόματη προσέγγιση Selection of smll set of seed pges to be evluted by n expet Afte the mnul selection of the eputble seed pges, the link stuctue of the web is exploited to discove othe pges tht e likely to be good. Ζητήματα: How we should implement the seed selection? How we cn discove the good pges? CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 5 Appoximte isoltion of the good set 3 5 6 7 8 Empiicl obsevtion: Good pges seldom point to bd ones. CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 6
Appoximte isoltion of the good set Exceptions 3 5 6 7 8 the cetos of good pges cn sometimes by «ticked» nd dd links to bd pges. Exmples: Unmodeted messge bods whee spmmes post messges tht include links to thei spm pges Honey pots pges tht contin some useful esouce but hve hidden links to thei spm pges (the honey pot ttcts people to point to it CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 7 Assessing Tust: Ocle Function We fomlize the notion of humn checking pge by biny «ocle» function O, ove ll pges p in V. O( p if p is bd if p is good Ocle invoctions e expensive nd time consuming. We do not wnt to cll the ocle function fo ll pges. Ou objective is to be selective, i.e. to sk humn expet to evlute only some of the pges CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 8
Συνάρτηση Εμπιστοσύνης (Tust Function To evlute pges without elying on O, we will estimte the likehood tht given pge p is good. Tust function yields vlues between (bd nd (good Idelly, fo ny p, T(p should give us the pobbility tht p is good Idel Tust Popety (ITP T(p P[ O(p] difficult to chieve even if T is not vey ccute we could exploit it to ode pges by thei likehood of being good CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 9 Συνάρτηση Εμπιστοσύνης (Tust Function Desied Tust Popety (elxtion of ITP T(p < T(q P[ O(p] < P[ O(q] T(p T(q P[ O(p] P[ O(q] Theshold Tust Popety (nothe elxtion of ITP T(p > δ O(q CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6
Υπολογισμός Εμπιστοσύνης: The ignont tust function The ignont tust function T We cn select t ndom seed set S of L pges nd cll the ocle on its elements. Let S+ be the good pges nd S- the bd ones. Since the emining pges e not checked we cn mk them with /. This is the ignont tust function T T ( p O( p / if p S othewise CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 Διάδοση Εμπιστοσύνης (Tust Popgtion We cn exploit the empiicl obsevtion «Good pges seldom point to bd ones», nd ssign scoe to ll pges tht e echble fom pge in S+ in M o fewe steps. Tust Function T M : T M O( p ( p / if if p S p S nd q S othewise + : M q p q-m->p : thee is pth of mximum length M fom q to p The bigge M the futhe we e fom good pges, the less cetin we e tht pge is good CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6
Εξασθένηση Εμπιστοσύνης (Tust Attenution Tust dmpening β β β β 3 β T(β T( T(3β Tust dmpening ssign scoe β (< to pges echble t step ssign the scoe β*β to pges echble t step, nd so on pges with multiple inlinks: mximum scoe o vege scoe CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 3 Εξασθένηση Εμπιστοσύνης (Tust Attenution Tust splitting Tust splitting T( T( / / /3 /3 /3 motivtion: the ce with which people dd links to thei pges in often invesely popotionl to the numbe of links on the pge if pge p hs tust scoe T(p nd it points to out(p pges, ech of them will eceive scoe fction T(p/ out(p fom p the ctul scoe of pge will be the sum of the scoe fctions eceived though its inlinks We could combine tust dmpening nd splitting 3 T(35/6 5/ 5/ CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6
CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 5 Ο Αλγόριθμος TustRnk In TustRnk we will combine tust dmpening nd splitting: in ech itetion, the tust scoe of node is split mong its neighbos nd dmpened by fcto of b We will compute TustRnk scoes using bised PgeRnk lgoithm the ocle-povided scoes eplce the unifom distibution CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 6 Επανάληψη: PgeRnk M p q if q out M p q if q p T, ( ( /, (, ( Tnsition mtix T 3 / / T M Adjcency mtix M The PgeRnk scoe R(p of pge is defined s + ( ( ( ( ( p in q N q out q R p R The equivlent mtix eqution: N N R T R + (
CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 7 Επανάληψη: PgeRnk 3 + ( 3 / / 3 + + ( 3/ 3/ 3 + + + + / ( 3/ / ( / ( 3/ ( / ( 3 N N R T R + ( CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 8 Επανάληψη: Ο Αλγόριθμος PgeRnk function PgeRnk Input T: tnsition mtix, N: numbe of pges, b : decy fcto fo bised PgeRnk, M b : numbe of bised PgeRnk itetions output t* : PgeRnk scoes (3 d /Ν * N // initil scoe fo ll pges is /Ν (5 t* d fo i to M b do // evlutes PgeRnk scoes t* b T t* + ( - b d etun t*
Ο Αλγόριθμος TustRnk function TustRnk Input T: tnsition mtix, N: numbe of pges, L: limit of ocle invoctions, b : decy fcto fo bised PgeRnk, M b : numbe of bised PgeRnk itetions output t* : TustRnk scoes ( s SelectSeed ( // seed-desibility: etuns vecto. // E.g. s(p is the desibility fo pge p ( σ Rnk({,,N}, s // odes in decesing ode of s-vlue ll pges (3 d N // initil scoe fo ll pges is fo i to L do // invokes ocle function on the most desible pges if O(σ(i then d(σ(i ( d : d / d // nomlize sttic distibution scoe (to sum up to (5 t* d fo i to M b do // evlutes TustRnk scoes using bised PgeRnk t* b T t* + ( - b d // note tht d eplces the unifom distibution etun t* CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 9 Ο Αλγόριθμος TustRnk Remks: function TustRnk Input T: tnsition mtix, N: numbe of pges, L: limit of ocle invoctions, Step 5 implements pticul vesion of tust dmpening nd splitting: in b : decy fcto fo bised PgeRnk, M b : numbe of bised PgeRnk itetions ech itetion, the tust scoe of node is split mong its neighbos nd output t* : TustRnk scoes dmpened by fcto of b ( s SelectSeed ( // seed-desibility: etuns vecto. // E.g. s(p is the desibility fo pge p The good seed pges hve no longe scoe of, howeve they still hve ( σ Rnk({,,N}, s // odes in decesing ode of s-vlue ll pges the highest scoes (3 d N // initil scoe fo ll pges is fo i to L do // invokes ocle function on the most desible pges if O(σ(i then d(σ(i ( d : d / d // nomlize sttic distibution scoe (to sum up to (5 t* d fo i to M b do // evlutes TustRnk scoes using bised PgeRnk t* b T t* + ( - b d // note tht d eplces the unifom distibution etun t* CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6
Selecting Seeds ( s SelectSeed ( // seed-desibility: etuns vecto. // E.g. s(p is the desibility fo pge p ( σ Rnk({,,N}, s // odes in decesing ode of s-vlue ll pges Πιθανές Στρατηγικές α Rndom selection β High PgeRnk Επιλέγουμε τις σελίδες με υψηλό PgeRnk σκορ διότι αυτές οι σελίδες συχνά εμφανίζονται στην κορυφή των απαντήσεων γ Invese PgeRnk 3 5 Although 5 hs the highest PgeRnk it is not good seed CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 Selecting Seeds: (γ Invese PgeRnk επειδή η εμπιστοσύνη διαχέεται από τις καλές σελίδες, είναι λογικό να επιλέξουμε εκείνες τις σελίδες από τις οποίες μπορούμε να φτάσουμε σε πολλές άλλες άρα μια ιδέα είναι να επιλέξουμε τις σελίδες με πολλά outlinks Επιλογή των p, p, p5 3 5 6 γενίκευση: επιλέγουμε τις σελίδες που δείχνουν σε πολλές σελίδες οι οποίες με τη σειρά τους δείχνουν σε πολλές σελίδες, κ.ο.κ Επιλογή της p 7 8 Τρόπος: Αφού η σπουδαιότητα μιας σελίδας εξαρτάται από τα outlinks της (και όχι από τα inlinks της, μπορούμε να χρησιμοποιήσουμε την PgeRnk αντιστρέφοντας την φορά των ακμών 3 5 6 7 8 CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6
Expeimentl Evlution Expeimentl Evlution Expeiments on the complete set of pges cwled nd indexed by AltVist (Aug. 3 Reduce computtionl cost: wok t the level of web sites (insted of web pges gouping of the (billions of pges into 3 millions sites websitea points to websiteb if one o moe pges fom websitea point to one o moe pges of websiteb So t most link my stt fom website A nd point to website B Obsevtions /3 of the websites e unefeenced So TustRnk cnnot diffeentite between them becuse they ll hve in(p Howeve they e low scoed nywy (e.g. by PgeRnk so they do not ppe high in nswes CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6
Expeimentl Evlution: Seed Selection ( s SelectSeed ( ( σ Rnk({,,N}, s (3 d N ( fo i to L do if O(σ(i then d(σ(i Seed Set Selection Invese PgeRnk pplied on the gph of websites woked bette thn High PgeRnk (fo the seed selection pocess Pmetes: :.85, itetions: With itetions the eltive odeing stbilized Mnul inspection of the top 5 sites ( S 5 Fom these only 78 wee used s good seeds, i.e. S+ 78 sites CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 5 Expeimentl Evlution: Evlution Smple To test the effectiveness of TustRnk we need Refeence Collection (e.g. something like TREC A smple set X of sites ws selected nd evluted mnully, i.e. the ocle function ws invoked (i.e. peson inspected them nd decided whethe they e spm o not The Smple set X ws not selected t ndom. Recll tht we e minly inteetested in spm pges tht ppe high in nswes The following smple selection method ws followed: Genete list of sites in decesing ode of thei PgeRnk scoes Segment them into buckets so tht the sum of the scoes in ech bcket equls 5% of the totl PgeRnk scoe bcket 86, bcket 665,, bcket 5 millions pges select 5 sites t ndom fom ech bucket ( * 5 CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 6
Expeimentl Evlution: Evlution Smple The esults of the mnul evlution (ocle invoction of the pges in the smple set of sites CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 7 This collection (i.e. X ws used fo evluting TustRnk vesus PgeRnk
Evlution Results: Comping PgeRnk with TustRnk Good sites Reputblewhite, dvetisementgy, webogniztiondk gy CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 9 Evlution Results: Comping PgeRnk with TustRnk Bd sites TustRnk is esonble spm detection tool CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 3
Μέτρα Αξιολόγησης της Συνάρτησης Εμπιστοσύνης (Evlution Metics fo the Tust Function Assume smple set X of web pges fo which we cn invoke both T nd O Web X CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 3 Μέτρα Αξιολόγησης της Συνάρτησης Εμπιστοσύνης: Pecision nd Recll We cn define pecision nd ecll bsed on the theshold tust popety: pec( T, O { p X T ( p > δ nd O( p { q X T ( q > δ } } X ec( T, O { p X T ( p > δ nd O( p { q X O( q } } CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 3
Μέτρα Αξιολόγησης της Συνάρτησης Εμπιστοσύνης: Pecision & Recll: Πειραματική Αξιολόγηση δ: such tht to septe bckets CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 33 Μέτρα Αξιολόγησης της Συνάρτησης Εμπιστοσύνης: Piwise Odeedness We cn genete fom X set P of pis nd we cn compute the fction of the pis fo which T did not mke mistke. CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 3 X Web The following metic cn signl violtion of the odeed tust popety if T ( p T ( q nd O( p < O( q I( T, O, p, q if T ( p T ( q nd O( p > O( q othewise P ( p, q P I( T, O, p, q piod( T, O, P P Piod(T,O,P if T does not mke ny mistke Piod(T,O,P if T mkes lwys mistkes P X x X
Μέτρα Αξιολόγησης της Συνάρτησης Εμπιστοσύνης: Piwise Odeedness: Πειραματική Αξιολόγηση CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 35 Συμπέρασμα Πειραματικής Αξιολόγησης TustRnk cn effectively filte out spm fom significnt fction of the Web, bsed on good seed set of less thn sites CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 36
Refeences Z. Gyongyi, H. Gci-Molin, J. Pedesen, Compting Web Spm with Tust Rnk, SIGMOD CS63 - Infomtion Retievl Systems Ynnis Tzitziks, U. of Cete, Sping 6 37