Προχωρημένες Λειτουργίες Επερώτησης Advanced Query Operations

Πανεπιστήμιο Κήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη 2007 HΥ463 - Συστήματα Ανάκτησης Πληοφοιών Information Retrieval (IR) Systems Ποχωημένες Λειτουγίες Επεώτησης Avance Query Operations Γιάννης Τζίτζικας ιάλεξη : 11 Ημεομηνία : 2-5-2007 CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 1 Διάθωση Διάλεξης Κίνητο Ανάδαση Συνάφειας (Relevance Feeback) Αναδιατύπωση Επεωτήσεων (Query Reformulation) Αναβάυνση Όων (Term Reweighting) Επέκταση (Διαστολή) Επεώτησης (Query Expansion), Αναδιατύπωση Επεωτήσεων για το Διανυσματικό Μοντέλο Optimal Query, Rochio Metho, Ie Metho, DeHi Metho Η έννοια του Optimal (or Best) Query Αξιολόγηση Ψευδο-ανάδαση συνάφειας (Pseuo relevance feeback) Επέκταση Επεωτήσεων Αυτόματη Τοπική (Επιτόπια) Ανάλυση (Automatic Local Analysis) Καθολική Ανάλυση Επέκταση Επεώτησης βάσει Θησαυού (Thesaurus-base Query Expansion) Αυτόματη Καθολική Ανάλυση (Automatic Global Analysis) Στατιστικοί Θησαυοί (Statistical Thesaurus) Γενετικοί Αλγόιθμοι CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 2

Κίνητο Έχει παατηηθεί ότι οι χήστες των ΣΑΠ δαπανούν πολύ χόνο αναδιατυπώνοντας την αχική τους επεώτηση ποκειμένου να βουν ικανοποιητικά έγγαφα Πιθανές αιτίες ο χήστης δεν γνωίζει το πειεχόμενο των υποκείμενων εγγάφων το λεξιλόγιο του χήστη μποεί να διαφέει από αυτό της συλλογής η αχική επεώτηση μποεί να είναι πιο γενική ή πιο ειδική από αυτή που θα έπεπε (καταλήγοντας είτε σε πάα πολλά ή σε πολύ λίγα έγγαφα) Η αχική επεώτηση μποεί να θεωηθεί ως η πώτη ποσπάθεια έκφασης της πληοφοιακής ανάγκης του χήστη Ανάγκη για τεχνικές αντιμετώπισης αυτού του ποβλήματος CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 3 Τόποι Αντιμετώπισης (1) Βελτίωση της αχικής επεώτησης (2) Χήση Ποφίλ Χήστη (3) Βελτίωση παάστασης κειμένων (4) Βελτίωση αλγοίθμου (μοντέλου) ανάκτησης Παατηήσεις Τα (2),(3),(4) έχουν πιο μόνιμο αποτέλεσμα (επηεάζουν την απάντηση και των επόμενων επεωτήσεων) Εδώ θα εστιάσουμε στο (1) CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 4

Τεχνικές Βελτίωσης της Αχικής Επεώτησης Κατηγοίες: (α) τεχνικές που απαιτούν είσοδο από τον χήστη (β) τεχνικές που δεν απαιτούν είσοδο (β1) που στηίζονται στα κουφαία έγγαφα που ανακτήθηκαν (β2) που στηίζονται σε όλα τα έγγαφα της συλλογής Ανάδαση Συνάφειας (Relevance Feeback): Η βασική ιδέα Βήματα: 1/ Μετά την παουσίαση των αποτελεσμάτων, επιτέπουμε στο χήστη να κίνει την συνάφεια ενός ή πεισσότεων εγγάφων της απάντησης 2/ Αξιοποιούμε αυτήν την πληοφοία για να αναδιατυπώσουμε την επεώτηση 3/ Κατόπιν δίδουμε στο χήστη την απάντηση της αναδιατυπωμένης επεώτησης 4/ Πήγαινε στο βήμα 1/ CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 6

Αχιτεκτονική για Ανάδαση Συνάφειας Query String Revise Query IR System Rankings ReRanke Documents Query Reformulation Ranke Documents 1. Doc1 2. Doc2 3. Doc3.. 1. Doc2 2. Doc4 3. Doc5.. 1. Doc1 2. Doc2 3. Doc3 Feeback.. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 7 Τμήματα της Αχιτεκτονικής που Εμπλέκονται user nee User Interface Text user feeback query retrieve ocs logical view Query Operations Searching Text Operations logical view Inexing inverte file Inex Text Corpus ranke ocs Ranking CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 8

http://nayana.ece.ucsb.eu/imsearch/imsearch.html q=bike CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 9 Αποτελέσματα Ans(«bike») CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 10

Μακάισμα των Συναφών από τον Χήστη CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 11 Απάντηση Αναδιατυπωμένης Επεώτησης CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 12

Αναδιατύπωση επεώτησης βάσει Ανάδασης Συνάφειας (Relevance Feeback: Query Reformulation) Τόποι αναδιατύπωσης της επεώτησης μετά την ανάδαση: Αναβάυνση των Όων (Term Reweighting): Αύξηση των βαών των όων που εμφανίζονται στα συναφή έγγαφα και μείωση των βαών των όων που εμφανίζονται στα μη-συναφή έγγαφα. Επέκταση επεώτησης (Query Expansion): Ποσθήκη νέων όων στην επεώτηση (π.χ. από γνωστά συναφή έγγαφα) Υπάχουν πολλοί αλγόιθμοι για επαναδιατύπωση επεώτησης CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 13 Αναδιατύπωση επεώτησης στο Διανυσματικό Μοντέλο Η βέλτιστη επεώτηση (Optimal Query) Ας υποθέσουμε ότι γνωίζουμε το σύνολο C r όλων των συναφών (με την πληοφοιακή ανάγκη του χήστη) εγγάφων. Η «καλύτεη επεώτηση» (αυτή που κατατάσσει στην κουφή όλα τα συναφή έγγαφα) θα ήταν: 1 1 qopt = C N C r C r r C Where N is the total number of ocuments. r (θα αναλύσουμε πεισσότεο αυτό το ζήτημα αγότεα) Αφού όμως δεν γνωίζουμε το σύνολο C r,θα λάβουμε υπόψη την αχική επεώτηση και την είσοδο του χήστη. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 14

Αναδιατύπωση επεώτησης στο Διανυσματικό Μοντέλο Αφού όμως δεν γνωίζουμε το σύνολο C r,θα λάβουμε υπόψη την αχική επεώτηση και την είσοδο του χήστη. answer(q): Κόκκινα: ο χήστης έδωσε ανητική ανάδαση Πάσινα: ο χήστης έδωσε θετική ανάδαση Μπλε: ο χήστης δεν έδωσε ανάδαση CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 15 Τεχνικές (Ι) Rocchio Metho (ΙΙ) Ie Metho (ΙΙΙ) DeHi Metho CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 16

(Ι) Stanar Rochio Metho Αφού το σύνολο όλων των συναφών είναι άγνωστο, χησιμοποίησε τα γνωστά συναφή (D r ) και γνωστά μη-συναφή (D n ) έγγαφα (από την απάντηση της αχικής επεώτησης και βάσει της εισόδου από τον χήστη) και επίσης συμπειέλαβε την αχική επεώτηση q. Αναδιατυπωμένη επεώτηση: q m = αq + β D r Dr γ D n Dn α: Tunable weight for initial query. β: Tunable weight for relevant ocuments. γ: Tunable weight for irrelevant ocuments. answer(q): Usually γ < β (the relevant ocs are more important) If γ=0 then we have positive feeback only CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 17 (ΙΙ) Ie Regular Metho Πεισσότεη ανάδαση => μεγαλύτεος βαθμός αναδιατύπωσης. Για αυτό, κατά την IDE Regular μέθοδο δεν κάνουμε κανονικοποίηση (βάσει του ποσού ανάδασης) q m = α q + β D r γ D n α: Tunable weight for initial query. β: Tunable weight for relevant ocuments. γ: Tunable weight for irrelevant ocuments. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 18

CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 19 (III) Ie Dec Hi Metho Τάση για απόιψη μόνο των μη-συναφών εγγάφων που έχουν υψηλό σκο (Bias towars reecting ust the highest ranke of the irrelevant ocuments:) ) ( max relevant non D m q q r + = γ β α α: Tunable weight for initial query. β: Tunable weight for relevant ocuments. γ: Tunable weight for irrelevant ocument. answer(q): CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 20 Σύγκιση μεθόδων (I) (II) (III) Γενικά, τα πειαματικά δεδομένα δεν δίνουν καθαό ποβάδισμα σε κάποια τεχνική. Όλες οι τεχνικές βελτιώνουν την απόδοση ( recall & precision) Συνήθως α=β=γ=1 + = n r D D m q q γ β α ) ( max relevant non D m q q r + = γ β α + = Dn n Dr r m D D q q γ β α

Αξιολόγηση Αποτελεσματικότητας Τεχνικών Ανάδασης Συνάφειας Remarks By construction, reformulate query will rank explicitly-marke relevant ocuments higher an explicitly-marke irrelevant ocuments lower. Metho shoul not get creit for improvement on these ocuments, since it was tol their relevance. In machine learning, this error is calle testing on the training ata. Evaluation shoul focus on generalizing to other un-rate ocuments. Fair Evaluation of Relevance Feeback Remove from the corpus any ocuments for which feeback was provie. Measure recall/precision performance on the remaining resiual collection. Compare to complete corpus, specific recall/precision numbers may ecrease since relevant ocuments were remove. However, relative performance on the resiual collection provies fair ata on the effectiveness of relevance feeback CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 21 Relevance Feeback Evaluation Simulate interactive retrieval consistently outperforms non-interactive retrieval (70% here). CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 22

Relevance Feeback Evaluation: Case Stuy Example of evaluation of interactive information retrieval [Koenemann & Belkin 1996] Goal of stuy: show that relevance feeback improves retrieval effectiveness Details 64 novice searchers (43 female, 21 male, native English) TREC test be (Wall Street Journal subset) Two search topics Automobile Recalls Tobacco Avertising an the Young Relevance ugements from TREC an experimenter System was INQUERY (vector space with some bells an whistles) Subects ha a tutorial session to learn the system Their goal was to keep moifying the query until they have evelope one that gets high precision Reweighting of terms similar to but ifferent from Rocchio CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 23 Creit: Marti Hearst CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 24

Evaluation:Precision vs. RF conition (from Koenemann & Belkin 96) Criterion: p@30 (precision at 30 ocuments) Compare: p@30 for users with relevance feeback p@30 for users without relevance feeback Goal: show that users with relevance feeback o better Results: Subects with relevance feeback ha 17-34% better performance But: Difference in precision numbers not statistically significant. Search times approximately equal Creit: Marti Hearst CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 25 Σιωπηές Υποθέσεις της Ανάδασης Συνάφειας A1: User has sufficient knowlege for initial query. However: User oes not always have sufficient initial knowlege. Examples: Misspellings, Mismatch of searcher s vocabulary vs collection vocabulary. A2: Relevance prototypes are well-behave. Either: All relevant ocuments are similar to a single prototype. Or: There are ifferent prototypes, but they have significant vocabulary overlap. However: There are several relevance prototypes. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 26

Γιατί η ανάδαση συνάφειας δεν χησιμοποιείται ευέως; Οι χήστες συχνά διστάζουν να δώσουν είσοδο Η ανάδαση έχει ως αποτέλεσμα μεγάλες επεωτήσεις των οποίων ο υπολογισμός απαιτεί πεισσότεο χόνο σε σύγκιση με τις συνηθισμένες επεωτήσεις που διατυπώνουν οι χήστες οι οποίες αποτελούνται από 2-3 λέξεις (search engines process lots of queries an allow little time for each one) Μεικές φοές η νέα απάντηση πειέχει έγγαφα τα οποία δεν μποούμε να καταλάβουμε πως ποέκυψαν CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 27 Ανάδαση Συνάφειας στον Παγκόσμιο Ιστό Some search engines offer a similar/relate pages feature (simplest form of relevance feeback) Google (link-base) But some on t because it s har to explain to average user. Excite initially ha true relevance feeback, but abanone it ue to lack of use. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 28

Ψευδοανάδαση Συνάφειας Ψευδοανάδαση Συνάφειας Pseuo Relevance Feeback Χήση μεθόδων ανάδασης αλλά χωίς είσοδο από το χήστη Υπόθεση ότι τα κουφαία m από τα ανακτημένα έγγαφα είναι συναφή (και χήση αυτών για ανάδαση) Μποούμε επίσης να χησιμοποιήσουμε τα τελευταία έγγαφα για ανητική ανάδαση Επιτέπει την επέκταση της επεώτησης με όους που σχετίζονται με τους όους της επεώτησης answer(q): CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 30

Ψευδοανάδαση Query String Revise Query IR System Rankings ReRanke Documents Query Reformulation Ranke Documents 1. Doc1 2. Doc2 3. Doc3.. 1. Doc2 2. Doc4 3. Doc5.. 1. Doc1 Pseuo 2. Doc2 3. Doc3. Feeback. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 31 Αξιολόγηση Ψευδοανάδασης Βέθηκε να βελτιώνει την απόδοση στο διαγωνισμό του TREC (ahoc retrieval task) Δουλεύει ακόμα καλύτεα αν τα κουφαία έγγαφα πέπει να ικανοποιούν και μια boolean έκφαση ποκειμένου να χησιμοποιηθούν για ανάδαση (π.χ. να πειέχουν όλους του όους της επεώτησης) CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 32

Αναλύοντας πεισσότεο την έννοια της βέλτιστης επεώτησης Πηγή: Yannis Tzitzikas an Yannis Theoharis, Naming Functions for the Vector Space Moel, 29th European Conference on Information Retrieval, Rome 2-5 April 2007 The Naming Problem We can view an IR system as a function from set of Queries to set of Answers S: Queries Answers Classically IR systems are goo at solving the equation S(q)=A wrt A, i.e.: S(q) =? The naming problem is the problem of solving the equation wrt q, i.e. : S(?) = A We can istinguish two formulations of the naming problem: For unorere answers (here A is a subset of Ob ) For orere answers (here A is an orere subset of Ob). where Ob is the set of store obects (e.g. ocuments) CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 34

The Naming Problem for Unorere Sets Basic Notions Exact name Upper name Best Upper Name Lower name Best Lower Name Relaxe name S(upperName(A)) A Ob S(lowerName(A)) CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 35 The Naming Problem for Orere Sets Basic Notions Exact name Upper name Best Upper Name Lower name Best Lower Name Relaxe name A = S(exactName(A)) = S(lowerName(A)) = 1, 2, 3 1, 2, 3, 8, 9, 1, 2, 5, 7, S(upperName(A)) = 1, 9, 2, 5, 3, 8, CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 36

Notations A(k) : the orere set comprising the first k elements of A A{k} : the set of elements that appear in A(k) A F : the restriction of A on the set F, i.e. the orere set obtaine if we exclue from A those elements that o not belong to F, Example if A = 1, 2, 3 then A(2)= 1, 2 A{2}= {1, 2} {A}= {1, 2, 3} if F={1,3}, then A F = 1, 3 CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 37 Defining Formally Relaxe/Upper/Lower/Exact Names A query is a relaxe name of an answer A iff : (A is a set) : S(q){m} A = where m >0. (A is an orere set) : S(q)(m) {A()} = A() where m >0. If m== A then q is an exact name If m>= A then q is an upper name (the best upper name if m is the least possible) If m=< A then q is a lower name (the best lower name if m is the greatest possible) So each query can be characterize by a pair (m,). We can now efine an orering over these pairs. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 38

Defining a total orering of queries (m,) > (m, ) iff: (> ) or (= an m<m ) Let Ob =5 an A =3 exact name upper names Let Ob =5 an A =3. The orering of the corresponing pairs are: lower names (1,1) (3,3) (4,3) (5,3) (2,2) (3,2) (4,2) (5,2) (2,1) (3,1) (4,1) (5,1) CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 39 Investigating possible approaches for solving the Naming Problem for Unorere Sets Let A = {1, 2, 3,, n} Possible name queries qa= ½(i,) where (i,) = arg max {ist(, ), A} qb=1/ A Σ { A} qc=1/ A Σ { a A} - 1/ Ob-A Σ { A} Notes qa: minimizes the maximum ist from the elements of A qb: minimizes the average ist from the elements of A qc: is the Rocchio metho (avg A, - avg Ob-A) Important remarks: None is guarantee to be the exact name (they are all relaxe names) However, if an exact name oes not exist, then qa is the best upper name Moreover, the evaluation of qa (i.e. the computation if its answer) is faster than the evaluation of qb, qc. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 40

The Naming Problem for Unorere Sets (III) Metho (a): Let A = {1, 2, 3, 4} o1 1 2 o2 3 4 1st step: Fin the pair of most istant ocuments Here: (1, 4) 2n step: Fin whether other ocuments lie in the area specifie by (1, 4) Here: {o1, o2}!= 0 Thus, qa = ½(1, 4) is an upper name of A Cost 1 st step: O( A 2 ) computations of similarity 2 n step: O( Ob ) or compute the answer of qa an check if S(qa){ A }=A CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 41 The Naming Problem for Orere Sets Let A = 1, 2, 3,, n Let i the max integer for which it hols: sim(1,2) > sim(1,3) > > sim(1,i) relaxe name, always 1 is a upper name, exact name, lower name, if i=n if i=n an S(1)(n)=A if i<n an S(1)(i)=A(i) Computational cost: O(n) computations of similarity to fin i. One query evaluation in orer to ecie whether 1 is lower/exact name. CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 42

Experimental Evaluation Experiments conucte over the experimental web search engine GRoogle http://groogle.cs.uoc.gr:8080/groogle The set A was selecte ranomly, the average times are liste in the Table below. ta: time to fin the query (fast, epens on A, not the size of the b) tb: time to evaluate the query (this is the only expensive task) tc: time to ecie what kin of name it is (fast) CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 43 Concluing Remarks an Further Research The naming problem has several applications (e.g. for relevance feeback or for proviing a flexible interaction scheme between the system an the users) We expresse formally several variations of this problem an relate optimality criteria (to best of our knowlege this analysis is novel) We provie optimal solutions E.g. we provie a metho that certainly returns the best upper name for unorere sets (this is not true for the Rocchio metho) We escribe (polynomial) algorithms for solving these problems Future research Exten the problem statement with an aitional parameter: the maximum number of wors that a name coul have, e.g. fin the best upper name with no more than 3 wors. Inexing structures for efficient computation of names For more see http://www.ics.forth.gr/~tzitzik/publications/2007_tzitzikastheoharisnaming.pf CS463 - Information Retrieval Systems Yannis Tzitzikas, U. of Crete, Spring 2007 44