Εύρεση & ιαχείριση Πληροφορίας στον Παγκόσµιο Ιστό

Εύρεση & ιαχείριση Πληροφορίας στον Παγκόσµιο Ιστό ιδάσκων ηµήτριος Κατσαρός, Ph.D. @ Τµ. Μηχανικών Η/Υ, Τηλεπικοινωνιών & ικτύων Πανεπιστήµιο Θεσσαλίας ιάλεξη 4η: 07/03/2007 1

Ανάκτηση µετοµοντέλο Vector space & Αποτίµηση συστηµάτων ανάκτησης 2

Ανάκτηση µε τοµοντέλο Vector space 3

Στη διάλεξη αυτή Vector space scoring Efficiency considerations Nearest neighbors and approximations 4

Documents as vectors At the end of Lecture 6 we said: Each doc d can now be viewed as a vector of wf idf values, one component for each term So we have a vector space terms are axes docs live in this space even with stemming, may have 50,000+ dimensions 5

Why turn docs into vectors? First application: Query-by-example Given a doc d, find others like it. Now that d is a vector, find vectors (docs) near it. 6

Intuition t 3 d 2 d 3 φ θ d 1 t 1 t 2 d 4 d 5 Postulate: Documents that are close together in the vector space talk about the same things. 7

Desiderata for proximity If d 1 is near d 2, then d 2 is near d 1. If d 1 near d 2, and d 2 near d 3, then d 1 is not far from d 3. No doc is closer to d than d itself. 8

First cut Idea: Distance between d 1 and d 2 is the length of the vector d 1 d 2. Euclidean distance Why is this not a great idea? We still haven t dealt with the issue of length normalization Short documents would be more similar to each other by virtue of length, not topic However, we can implicitly normalize by looking at angles instead 9

Cosine similarity Distance between vectors d 1 and d 2 captured by the cosine of the angle x between them. Note this is similarity, not distance No triangle inequality for similarity. t 3 d 2 d 1 θ t 1 t 2 10

Cosine similarity A vector can be normalized (given a length of 1) by dividing each of its components by its length here we use the L 2 norm This maps vectors onto the unit sphere: Then, d r = n =1, =1 j w i i Longer documents don t get more weight j x 2 = i x 2 i 11

12 Cosine similarity Cosine of angle between two vectors The denominator involves the lengths of the vectors. = = = = = n i k i n i j i n i k i j i k j k j k j w w w w d d d d d d sim 1 2, 1 2, 1,, ), ( r r r r Normalization

Normalized vectors For normalized vectors, the cosine is simply the dot product: r r r r cos( d j, d k ) = d j d k 13

Example Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights. tf weights SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 SaS PaP WH affection 0.996 0.993 0.847 jealous 0.087 0.120 0.466 gossip 0.017 0.000 0.254 cos(sas, PAP) =.996 x.993 +.087 x.120 +.017 x 0.0 = 0.999 cos(sas, WH) =.996 x.847 +.087 x.466 +.017 x.254 = 0.889 14

Cosine similarity exercises Exercise: Rank the following by decreasing cosine similarity. Assume tf-idf weighting: Two docs that have only frequent words (the, a, an, of) in common. Two docs that have no words in common. Two docs that have many rare words in common (wingspan, tailfin). 15

Exercise Euclidean distance between vectors: d j d k = n i = 1 ( ) d d i, j i, k Show that, for normalized vectors, Euclidean distance gives the same proximity ordering as the cosine measure 2 16

17 Queries in the vector space model Central idea: the query as a vector: We regard the query as short document We return the documents ranked by the closeness of their vectors to the query, also represented as a vector. Note that d q is very sparse! = = = = = n i q i n i j i n i q i j i q j q j q j w w w w d d d d d d sim 1 2, 1 2, 1,, ), ( r r r r

Summary: What s the point of using vector spaces? A well-formed algebraic space for retrieval Key: A user s query can be viewed as a (very) short document. Query becomes a vector in the same space as the docs. Can measure each doc s proximity to it. Natural measure of scores/ranking no longer Boolean. Queries are expressed as bags of words Other similarity measures: see http://www.lans.ece.utexas.edu/~strehl/diss/node52.html for a survey 18

Digression: spamming indices This was all invented before the days when people were in the business of spamming web search engines. Consider: Indexing a sensible passive document collection vs. An active document collection, where people (and indeed, service companies) are shaping documents in order to maximize scores Vector space similarity may not be as useful in this context. 19

Interaction: vectors and phrases Scoring phrases doesn t fit naturally into the vector space world: tangerine trees marmalade skies Positional indexes don t calculate or store tf.idf information for tangerine trees Biword indexes treat certain phrases as terms For these, we can pre-compute tf.idf. Theoretical problem of correlated dimensions Problem: we cannot expect end-user formulating queries to know what phrases are indexed We can use a positional index to boost or ensure phrase occurrence 20

Vectors and Boolean queries Vectors and Boolean queries really don t work together very well In the space of terms, vector proximity selects by spheres: e.g., all docs having cosine similarity 0.5 to the query Boolean queries on the other hand, select by (hyper-)rectangles and their unions/intersections Round peg - square hole 21

Vectors and wild cards How about the query tan* marm*? Can we view this as a bag of words? Thought: expand each wild-card into the matching set of dictionary terms. Danger unlike the Boolean case, we now have tfs and idfs to deal with. Net not a good idea. 22

Vector spaces and other operators Vector space queries are apt for no-syntax, bag-ofwords queries Clean metaphor for similar-document queries Not a good combination with Boolean, wild-card, positional query operators But 23

Query language vs. scoring May allow user a certain query language, say Free text basic queries Phrase, wildcard etc. in Advanced Queries. For scoring (oblivious to user) may use all of the above, e.g. for a free text query Highest-ranked hits have query as a phrase Next, docs that have all query terms near each other Then, docs that have some query terms, or all of them spread out, with tf x idf weights for scoring 24

Exercises How would you augment the inverted index built in lectures 1 3 to support cosine ranking computations? Walk through the steps of serving a query. The math of the vector space model is quite straightforward, but being able to do cosine ranking efficiently at runtime is nontrivial 25

Efficient cosine ranking Find the k docs in the corpus nearest to the query k largest query-doc cosines. Efficient ranking: Computing a single cosine efficiently. Choosing the k largest cosine values efficiently. Can we do this without computing all n cosines? n = number of documents in collection 26

Efficient cosine ranking What we re doing in effect: solving the k-nearest neighbor problem for a query vector In general, we do not know how to do this efficiently for high-dimensional spaces But it is solvable for short queries, and standard indexes are optimized to do this 27

Computing a single cosine For every term i, with each doc j, store term frequency tf ij. Some tradeoffs on whether to store term count, term weight, or weighted by idf i. At query time, use an array of accumulators A j to accumulate component-wise sum r sim( d, r d j q m ) = w w i= 1 If you re indexing 5 billion documents (web search) an array of accumulators is infeasible i, j i, q Ideas? 28

Encoding document frequencies aargh 10 1,2 7,3 83,1 87,2 abacus 8 1,1 5,1 13,1 17,1 acacia 35 7,1 8,2 40,1 97,3 Add tf t,d to postings lists Now almost always as frequency scale at runtime Unary code is quite effective here γ code (Lecture 3) is an even better choice Overall, requires little additional space Why? 29

Computing the k largest cosines: selection vs. sorting Typically we want to retrieve the top k docs (in the cosine ranking for the query) not to totally order all docs in the corpus Can we pick off docs with k highest cosines? 30

Use heap for selecting top k Binary tree in which each node s value > the values of children Takes 2n operations to construct, then each of k log n winners read off in 2log n steps. For n=1m, k=100, this is about 10% of the cost of sorting. 1.9.3.3.8.1.1 31

Bottleneck Still need to first compute cosines from query to each of n docs several seconds for n = 1M. Can select from only non-zero cosines Need union of postings lists accumulators (<<1M): on the query aargh abacus would only do accumulators 1,5,7,13,17,83,87 (below). Better iff this set is < 20% of n aargh 10 abacus 8 acacia 35 1,2 7,3 83,1 87,2 1,1 5,1 13,1 17,1 7,1 8,2 40,1 97,3 32

Removing bottlenecks Can further limit to documents with non-zero cosines on rare (high idf) words Or enforce conjunctive search (a la Google): nonzero cosines on all words in query Get # accumulators down to {min of postings lists sizes} But in general still potentially expensive Sometimes have to fall back to (expensive) softconjunctive search: If no docs match a 4-term query, look for 3-term subsets, etc. 33

Can we avoid all this computation? Yes, but may occasionally get an answer wrong a doc not in the top k may creep into the answer. 34

Limiting the accumulators: Best m candidates Preprocess: Pre-compute, for each term, its m nearest docs. (Treat each term as a 1-term query.) lots of preprocessing. Result: preferred list for each term. Search: For a t-term query, take the union of their t preferred lists call this set S, where S mt. Compute cosines from the query to only the docs in S, and choose the top k. Need to pick m>k to work well empirically. 35

Exercises Fill in the details of the calculation: Which docs go into the preferred list for a term? Devise a small example where this method gives an incorrect ranking. 36

Limiting the accumulators: Frequency/impact ordered postings Idea: we only want to have accumulators for documents for which wf t,d is high enough We sort postings lists by this quantity We retrieve terms by idf, and then retrieve only one block of the postings list for each term We continue to process more blocks of postings until we have enough accumulators Can continue one that ended with highest wf t,d The number of accumulators is bounded Anh et al. 2001 37

Cluster pruning: preprocessing Pick n docs at random: call these leaders For each other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~ n followers. 38

Cluster pruning: query processing Process a query as follows: Given query Q, find its nearest leader L. Seek k nearest docs from among L s followers. 39

Visualization Query Leader Follower 40

Why use random sampling Fast Leaders reflect data distribution 41

General variants Have each follower attached to a=3 (say) nearest leaders. From query, find b=4 (say) nearest leaders and their followers. Can recur on leader/follower construction. 42

Exercises To find the nearest leader in step 1, how many cosine computations do we do? Why did we have n in the first place? What is the effect of the constants a,b on the previous slide? Devise an example where this is likely to fail i.e., we miss one of the k nearest docs. Likely under random sampling. 43

Dimensionality reduction What if we could take our vectors and pack them into fewer dimensions (say 50,000 100) while preserving distances? (Well, almost.) Speeds up cosine computations. Two methods: Random projection. Latent semantic indexing. 44

Random projection onto k<<m axes Choose a random direction x 1 in the vector space. For i = 2 to k, Choose a random direction x i that is orthogonal to x 1, x 2, x i 1. Project each document vector into the subspace spanned by {x 1, x 2,, x k }. 45

E.g., from 3 to 2 dimensions t 3 d 2 x 2 x 2 d 2 d 1 d 1 t 1 t 2 x 1 x 1 is a random direction in (t 1,t 2,t 3 ) space. x 2 is chosen randomly but orthogonal to x 1. Dot product of x 1 and x 2 is zero. x1 46

Guarantee With high probability, relative distances are (approximately) preserved by projection. Pointer to precise theorem in Resources. 47

Computing the random projection Projecting n vectors from m dimensions down to k dimensions: Start with m n matrix of terms docs, A. Find random k m orthogonal projection matrix R. Compute matrix product W = R A. j th column of W is the vector corresponding to doc j, but now in k << m dimensions. 48

Cost of computation This takes a total of kmn multiplications. Expensive see Resources for ways to do essentially the same thing, quicker. Question: by projecting from 50,000 dimensions down to 100, are we really going to make each cosine computation faster? Why? 49

Latent semantic indexing (LSI) Another technique for dimension reduction Random projection was data-independent LSI on the other hand is data-dependent Eliminate redundant axes Pull together related axes hopefully car and automobile More on LSI when studying clustering, later in this course. 50

Resources IIR 7 MG Ch. 4.4-4.6; MIR 2.5, 2.7.2; FSNLP 15.4 Anh, V.N., de Krester, O, and A. Moffat. 2001. Vector-Space Ranking with Effective Early Termination", Proc. 24th Annual International ACM SIGIR Conference, 35-42. Anh, V.N. and A. Moffat. 2006. Pruned query evaluation using precomputed impacts. SIGIR 2006, 372-379. Random projection theorem Dasgupta and Gupta. An elementary proof of the Johnson-Lindenstrauss Lemma (1999). Faster random projection - A.M. Frieze, R. Kannan, S. Vempala. Fast Monte-Carlo Algorithms for finding low-rank approximations. IEEE Symposium on Foundations of Computer Science, 1998. 51

Αποτίµηση 52

Στη συνέχεια Results summaries: Making our good results usable to a user How do we know if our results are any good? Evaluating a search engine Benchmarks Precision and recall 53

Results summaries 54

Summaries Having ranked the documents matching a query, we wish to present a results list Most commonly, the document title plus a short summary The title is typically automatically extracted from document metadata What about the summaries? 55

Summaries Two basic kinds: Static Dynamic A static summary of a document is always the same, regardless of the query that hit the doc Dynamic summaries are query-dependent attempt to explain why the document was retrieved for the query at hand 56

Static summaries In typical systems, the static summary is a subset of the document Simplest heuristic: the first 50 (or so this can be varied) words of the document Summary cached at indexing time More sophisticated: extract from each document a set of key sentences Simple NLP heuristics to score each sentence Summary is made up of top-scoring sentences. Most sophisticated: NLP used to synthesize a summary Seldom used in IR; cf. text summarization work 57

Dynamic summaries Present one or more windows within the document that contain several of the query terms KWIC snippets: Keyword in Context presentation Generated in conjunction with scoring If query found as a phrase, the/some occurrences of the phrase in the doc If not, windows within the doc that contain multiple query terms The summary itself gives the entire content of the window all terms, not only the query terms how? 58

Generating dynamic summaries If we have only a positional index, we cannot (easily) reconstruct context surrounding hits If we cache the documents at index time, can run the window through it, cueing to hits found in the positional index E.g., positional index says the query is a phrase in position 4378 so we go to this position in the cached document and stream out the content Most often, cache a fixed-size prefix of the doc Note: Cached copy can be outdated 59

Dynamic summaries Producing good dynamic summaries is a tricky optimization problem The real estate for the summary is normally small and fixed Want short item, so show as many KWIC matches as possible, and perhaps other things like title Want snippets to be long enough to be useful Want linguistically well-formed snippets: users prefer snippets that contain complete phrases Want snippets maximally informative about doc But users really like snippets, even if they complicate IR system design 60

Evaluating search engines 61

Measures for a search engine How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of query language Ability to express complex information needs Speed on complex queries 62

Measures for a search engine All of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness precise The key measure: user happiness What is this? Speed of response/size of index are factors But blindingly fast, useless answers won t make a user happy Need a way of quantifying user happiness 63

Measuring user happiness Issue: who is the user we are trying to make happy? Depends on the setting Web engine: user finds what they want and return to the engine Can measure rate of return users ecommerce site: user finds what they want and make a purchase Is it the end-user, or the ecommerce site, whose happiness we measure? Measure time to purchase, or fraction of searchers who become buyers? 64

Measuring user happiness Enterprise (company/govt/academic): Care about user productivity How much time do my users save when looking for information? Many other criteria having to do with breadth of access, secure access, etc. 65

Happiness: elusive to measure Commonest proxy: relevance of search results But how do you measure relevance? We will detail a methodology here, then examine its issues Relevant measurement requires 3 elements: 1. A benchmark document collection 2. A benchmark suite of queries 3. A binary assessment of either Relevant or Irrelevant for each query-doc pair Some work on more-than-binary, but not the standard 66

Evaluating an IR system Note: the information need is translated into a query Relevance is assessed relative to the information need not the query E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective You evaluate whether the doc addresses the information need, not whether it has those words 67

Standard relevance benchmarks TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years Reuters and other benchmark doc collections used Retrieval tasks specified sometimes as queries Human experts mark, for each query and for each doc, Relevant or Irrelevant or at least for subset of docs that some system returned for that query 68

Unranked retrieval evaluation: Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved relevant) Retrieved Not Retrieved Relevant tp fn Not Relevant fp tn Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) 69

Accuracy Given a query an engine classifies each doc as Relevant or Irrelevant. Accuracy of an engine: the fraction of these classifications that is correct. Why is this not a very useful evaluation measure in IR? 70

Why not just use accuracy? How to build a 99.9999% accurate search engine on a low budget. Search for: 0 matching results found. People doing information retrieval want to find something and have a certain tolerance for junk. 71

Precision/Recall You can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved In a good system, precision decreases as either number of docs retrieved or recall increases A fact with strong empirical confirmation 72

Difficulties in using precision/recall Should average over large corpus/query ensembles Need human relevance assessments People aren t reliable assessors Assessments have to be binary Nuanced assessments? Heavily skewed by corpus/authorship Results may not translate from one domain to another 73

A combined measure: F Combined measure that assesses this tradeoff is F measure (weighted harmonic mean): F 1 α P 1 + (1 α) People usually use balanced F 1 measure i.e., with β = 1 or α = ½ + 1) PR P + R Harmonic mean is a conservative average 1 R See CJ van Rijsbergen, Information Retrieval = 2 ( β β = 2 74

F 1 and other averages Combined Measures 100 80 60 40 20 Minimum Maxim um Arithmetic Geometric Harmonic 0 0 20 40 60 80 100 Precision (Recall fixed at 70%) 75

Evaluating ranked results Evaluation of ranked results: The system can return any number of results By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve 76

A precision-recall curve 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall 77

Averaging over queries A precision-recall graph for one query isn t a very sensible thing to look at You need to average performance over a whole bunch of queries. But there s a technical issue: Precision-recall calculations place some points on the graph How do you determine a value (interpolate) between the points? 78

Interpolated precision Idea: f locally precision increases with increasing recall, then you should get to count that So you max of precisions to right of value 79

Evaluation Graphs are good, but people want summary measures! Precision at fixed retrieval level Perhaps most appropriate for web search: all people want are good matches on the first one or two results pages But has an arbitrary parameter of k 11-point interpolated average precision The standard measure in the TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them Evaluates performance at all recall levels 80

Typical (good) 11 point precisions SabIR/Cornell 8A1 11pt precision from TREC 8 (1999) 1 0.8 Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall 81

Yet more evaluation measures Mean average precision (MAP) Average of the precision value obtained for the top k documents, each time a relevant doc is retrieved Avoids interpolation, use of fixed recall levels MAP for query collection is arithmetic ave. Macro-averaging: each query counts equally R-precision If have known (though perhaps incomplete) set of relevant documents of size Rel, then calculate precision of top Rel docs returned Perfect system could score 1.0. 82

Variance For a test collection, it is usual that a system does crummily on some information needs (e.g., MAP = 0.1) and excellently on others (e.g., MAP = 0.7) Indeed, it is usually the case that the variance in performance of the same system across queries is much greater than the variance of different systems on the same query. That is, there are easy information needs and hard ones! 83

Creating Test Collections for IR Evaluation 84

Test Corpora 85

From corpora to test collections Still need Test queries Relevance assessments Test queries Must be germane to docs available Best designed by domain experts Random query terms generally not a good idea Relevance assessments Human judges, time-consuming Are human panels perfect? 86

Unit of Evaluation We can compute precision, recall, F, and ROC curve for different units. Possible units Documents (most common) Facts (used in some TREC evaluations) Entities (e.g., car companies) May produce different results. Why? 87

Kappa measure for inter-judge (dis)agreement Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement Kappa = [ P(A) P(E) ] / [ 1 P(E) ] P(A) proportion of time judges agree P(E) what agreement would be by chance Kappa = 0 for chance agreement, 1 for total agreement. 88

Kappa Measure: Example P(A)? P(E)? Number of docs Judge 1 Judge 2 300 Relevant Relevant 70 Nonrelevant Nonrelevant 20 Relevant Nonrelevant 10 Nonrelevant relevant 89

Kappa Example P(A) = 370/400 = 0.925 P(nonrelevant) = (10+20+70+70)/800 = 0.2125 P(relevant) = (10+20+300+300)/800 = 0.7878 P(E) = 0.2125^2 + 0.7878^2 = 0.665 Kappa = (0.925 0.665)/(1-0.665) = 0.776 Kappa > 0.8 = good agreement 0.67 < Kappa < 0.8 -> tentative conclusions (Carletta 96) Depends on purpose of study For >2 judges: average pairwise kappas 90

TREC TREC Ad Hoc task from first 8 TRECs is standard IR task 50 detailed information needs a year Human evaluation of pooled results returned More recently other related things: Web track, HARD A TREC query (TREC 5) <top> <num> Number: 225 <desc> Description: What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities? </top> 91

Interjudge Agreement: TREC 3 92

Impact of Inter-judge Agreement Impact on absolute performance measure can be significant (0.32 vs 0.39) Little impact on ranking of different systems or relative performance 93

Critique of pure relevance Relevance vs Marginal Relevance A document can be redundant even if it is highly relevant Duplicates The same information from different sources Marginal relevance is a better measure of utility for the user. Using facts/entities as evaluation units more directly measures true relevance. But harder to create evaluation set See Carbonell reference 94

Can we avoid human judgment? Not really Makes experimental work hard Especially on a large scale In some very specific settings, can use proxies Example below, approximate vector space retrieval But once we have test collections, we can reuse them (so long as we don t overtrain too badly) 95

Approximate vector retrieval Given n document vectors and a query, find the k doc vectors closest to the query. Exact retrieval we know of no better way than to compute cosines from the query to every doc Approximate retrieval schemes such as cluster pruning in lecture 6 Given such an approximate retrieval scheme, how do we measure its goodness? 96

Approximate vector retrieval Let G(q) be the ground truth of the actual k closest docs on query q Let A(q) be the k docs returned by approximate algorithm A on query q For performance we would measure A(q) G(q) Is this the right measure? 97

Alternative proposal Focus instead on how A(q) compares to G(q). Goodness can be measured here in cosine proximity to q: we sum up q d over d A(q). Compare this to the sum of q d over d G(q). Yields a measure of the relative goodness of A vis-à-vis G. Thus A may be 90% as good as the ground-truth G, without finding 90% of the docs in G. For scored retrieval, this may be acceptable: Most web engines don t always return the same answers for a given query. 98

Resources for this lecture IIR 8 MIR Chapter 3 MG 4.5 Carbonell and Goldstein 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR 21. 99