Ανάκληση Πληροφορίας. Information Retrieval. Διδάσκων Δημήτριος Κατσαρός



Σχετικά έγγραφα
Ανάκληση Πληποφοπίαρ. Information Retrieval. Διδάζκων Δημήηριος Καηζαρός

Ευρετηρίαση ΜΕΡΟΣ ΙΙ

Finite Field Problems: Solutions

2 Composition. Invertible Mappings

The Simply Typed Lambda Calculus

Section 8.3 Trigonometric Equations

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Homework 3 Solutions

EE512: Error Control Coding

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

derivation of the Laplacian from rectangular to spherical coordinates

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Example Sheet 3 Solutions

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Concrete Mathematics Exercises from 30 September 2016

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

C.S. 430 Assignment 6, Sample Solutions

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Inverse trigonometric functions & General Solution of Trigonometric Equations

Other Test Constructions: Likelihood Ratio & Bayes Tests

Lecture 2. Soundness and completeness of propositional logic

Matrices and Determinants

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Math221: HW# 1 solutions

Section 7.6 Double and Half Angle Formulas

Approximation of distance between locations on earth given by latitude and longitude

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Math 6 SL Probability Distributions Practice Test Mark Scheme

4.6 Autoregressive Moving Average Model ARMA(1,1)

Every set of first-order formulas is equivalent to an independent set

HISTOGRAMS AND PERCENTILES What is the 25 th percentile of a histogram? What is the 50 th percentile for the cigarette histogram?

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

5.4 The Poisson Distribution.

Assalamu `alaikum wr. wb.

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Instruction Execution Times

Mean bond enthalpy Standard enthalpy of formation Bond N H N N N N H O O O

Information Retrieval

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Block Ciphers Modes. Ramki Thurimella

Potential Dividers. 46 minutes. 46 marks. Page 1 of 11

The challenges of non-stable predicates

Numerical Analysis FMN011

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

(C) 2010 Pearson Education, Inc. All rights reserved.

From the finite to the transfinite: Λµ-terms and streams

Advanced Subsidiary Unit 1: Understanding and Written Response

LESSON 14 (ΜΑΘΗΜΑ ΔΕΚΑΤΕΣΣΕΡΑ) REF : 202/057/34-ADV. 18 February 2014

Ανάκληση Πληποφοπίαρ. Information Retrieval. Διδάζκων Δημήηριος Καηζαρός

ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. ΘΕΜΑ: «ιερεύνηση της σχέσης µεταξύ φωνηµικής επίγνωσης και ορθογραφικής δεξιότητας σε παιδιά προσχολικής ηλικίας»

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Συντακτικές λειτουργίες

Abstract Storage Devices

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Συστήματα Διαχείρισης Βάσεων Δεδομένων

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

the total number of electrons passing through the lamp.

TMA4115 Matematikk 3

2. Let H 1 and H 2 be Hilbert spaces and let T : H 1 H 2 be a bounded linear operator. Prove that [T (H 1 )] = N (T ). (6p)

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Exercises to Statistics of Material Fatigue No. 5

Right Rear Door. Let's now finish the door hinge saga with the right rear door

7 Present PERFECT Simple. 8 Present PERFECT Continuous. 9 Past PERFECT Simple. 10 Past PERFECT Continuous. 11 Future PERFECT Simple

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

PARTIAL NOTES for 6.1 Trigonometric Identities

Statistical Inference I Locally most powerful tests

Partial Trace and Partial Transpose

Trigonometric Formula Sheet

Démographie spatiale/spatial Demography

DISTRIBUTED CACHE TABLE: EFFICIENT QUERY-DRIVEN PROCESSING OF MULTI-TERM QUERIES IN P2P NETWORKS

Μηχανική Μάθηση Hypothesis Testing

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Srednicki Chapter 55

Εύρεση & ιαχείριση Πληροφορίας στον Παγκόσµιο Ιστό

Εύρεση & ιαχείριση Πληροφορίας στον Παγκόσµιο Ιστό. Εισαγωγή στο µάθηµα. Εισαγωγή στην Ανάκτηση Πληροφορίας. Απαιτήσεις του µαθήµατος

ΕΠΙΧΕΙΡΗΣΙΑΚΗ ΑΛΛΗΛΟΓΡΑΦΙΑ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΑ ΣΤΗΝ ΑΓΓΛΙΚΗ ΓΛΩΣΣΑ

ST5224: Advanced Statistical Theory II

Problem Set 3: Solutions

Solutions to Exercise Sheet 5

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

ΕΠΛ660. Ανάκτηση Πληροφοριών και. Μάριος. ικαιάκος και Γιώργος Πάλλης

SOAP API. Table of Contents

Congruence Classes of Invertible Matrices of Order 3 over F 2

Review Test 3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

Reminders: linear functions

[1] P Q. Fig. 3.1

ΦΥΛΛΟ ΕΡΓΑΣΙΑΣ Α. Διαβάστε τις ειδήσεις και εν συνεχεία σημειώστε. Οπτική γωνία είδησης 1:.

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ Η/Υ & ΠΛΗΡΟΦΟΡΙΚΗΣ. του Γεράσιμου Τουλιάτου ΑΜ: 697

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών. Δημήτρης Πλεξουσάκης. Physical DB Design

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Transcript:

Ανάκληση Πληροφορίας Information Retrieval Διδάσκων Δημήτριος Κατσαρός Διάλεξη 5η: 26/02/2014 1

Phrase queries 2

Phrase queries Want to answer queries such as stanford university as a phrase Thus the sentence I went to university at Stanford is not a match. The concept of phrase queries has proven easily understood by users; about 10% of web queries are phrase queries No longer suffices to store only <term : docs> entries 3

A first attempt: Biword indexes Index every consecutive pair of terms in the text as a phrase For example the text Friends, Romans, Countrymen would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate. 4

Longer phrase queries Longer phrases are processed as we did with wildcards: stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. Can have false positives! 5

Extended biwords Parse the indexed text and perform part-of-speech-tagging (POST). Bucket the terms into (say) Nouns (N) and articles/prepositions (X). Now deem any string of terms of the form NX*N to be an extended biword. Each such extended biword is now made a term in the dictionary. Example: catcher in the rye N X X N Query processing: parse it into N s and X s Segment query into enhanced biwords Look up index 6

Issues for biword indexes False positives, as noted before Index blowup due to bigger dictionary For extended biword index, parsing longer queries into conjunctions: E.g., the query tangerine trees and marmalade skies is parsed into tangerine trees AND trees and marmalade AND marmalade skies Not standard solution (for all biwords) 7

Solution 2: Positional indexes Store, for each term, entries of the form: <number of docs containing term; doc1: position1, position2 ; doc2: position1, position2 ; etc.> 8

Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, > Which of docs 1,2,4,5 could contain to be or not to be? Can compress position values/offsets Nevertheless, this expands postings storage substantially 9

Processing a phrase query Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with to be or not to be. to: be: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Same general method for proximity searches 10

Proximity queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT Here, /k means within k words of. Clearly, positional indexes can be used for such queries; biword indexes cannot. Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k? 11

Positional index size You can compress position values/offsets: we ll talk bout that in lecture 5 Nevertheless, a positional index expands postings storage substantially Nevertheless, it is now standardly used because of the power and usefulness of phrase and proximity queries whether used explicitly or implicitly in a ranking retrieval system. 12

Positional index size Need an entry for each occurrence, not just once per document Index size depends on average document size Average web page has <1000 terms SEC filings, books, even some epic poems easily 100,000 terms Consider a term with frequency 0.1% Why? Document size Postings Positional postings 1000 100,000 1 1 1 100 13

Rules of thumb A positional index is 2 4 as large as a nonpositional index Positional index size 35 50% of volume of original text Caveat: all of this holds for English-like languages 14

Combination schemes These two approaches can be profitably combined For particular phrases ( Michael Jackson, Britney Spears ) it is inefficient to keep on merging positional postings lists Even more so for phrases like The Who Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme A typical web query mixture was executed in ¼ of the time of using just a positional index It required 26% more space than having a positional index alone 15

Sec. 3.1 Dictionary data structures for inverted indexes The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list in what data structure? 16

Sec. 3.1 A naïve dictionary An array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time? 17

Sec. 3.1 Dictionary data structures Two main choices: Hash table Tree Some IR systems use hashes, some trees 18

Sec. 3.1 Hashes Each vocabulary term is hashed to an integer (We assume you ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything 19

Sec. 3.1 Tree: binary tree a-m Root n-z a-hu hy-m n-sh si-z aardvark huygens sickle zygot 20

Sec. 3.1 Tree: B-tree a-hu hy-m n-z Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. 21

Sec. 3.1 Trees Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings but we standardly have one Pros: Solves the prefix problem (terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problem 22