Part B. CS-463 Information Retrieval Systems. Yannis Tzitzikas. University of Crete. CS-463,Spring 05 PART (A) PART (C):

CS-463 Information Systems Μοντέλα Ανάκτησης ( Models) Part B Yannis Tzitzikas University of Crete CS-463,Spring 05 Lecture : 4 Date : 3-3- ιάρθρωση ιάλεξης PART (A) Ανάκτηση και Φιλτράρισµα Εισαγωγή στα Μοντέλα Άντλησης Κατηγορίες Μοντέλων Exact vs Best Match Τα κλασσικά µοντέλα ανάκτησης Το Boolean Μοντέλο Στατιστικά Μοντέλα - Βάρυνση Όρων Το ιανυσµατικό Μοντέλο Το Πιθανοκρατικό Μοντέλο PART (B):Εναλλακτικά µοντέλα (Ι)Συνολοθεωρητικάµοντέλα Fuzzy Model Extended Boolean Model (ΙΙ) ΑλγεβρικάΜοντέλα Latent Semantic Indexing Neural Netwok Model PART (C): (ΙΙΙ)ΠιθανοκρατικάΜοντέλα Bayesian Network Model Inference Network Model 56

Κίνητρο Συνολοθεωρητικά Μοντέλα Επέκταση του Boolean model µε µερικό ταίριασµα Μοντέλα: Fuzzy Set Model Extended Boolean Model 57 Information Models Fuzzy Set Model 2

Fuzzy Set Model Βασική Ιδέα: Έγγραφα και επερωτήσεις παριστάνονται σε σύνολα όρων ευρετηρίου Κάθε όρος σχετίζεται µε ένα fuzzy set Κάθε έγγραφο έχει ένα degree of membership σε αυτό το fuzzy set Υπάρχουν αρκετά µοντέλα που θεµελιώνονται έτσι, εδώ θα δούµε τοµοντέλοπουπροτάθηκεαπό Ogawa, Morita, and Kobayashi (99) 59 Fuzzy Set Model: Ηγενικήιδέα Η γενική ιδέα µε ένα παράδειγµα: Έστω επερώτηση q=αυτοκίνητο Έστω έγγραφο d που δεν περιέχει τη λέξη αυτοκίνητο αλλά περιέχει τη λέξη «όχηµα». Ανυπάρχουνπολλάέγγραφαπουπεριέχουνκαιτιςδυολέξεις, τότε, υπάρχει ισχυρή συσχέτιση των δυο αυτών λέξεων, και => άρα το d µπορεί να θεωρηθεί συναφές µε την επερώτηση q. Θεµελίωση της ιδέας µε Fuzzy Theory 60 3

Background: Fuzzy Set Theory (Zadeh 965) Framework for representing classes whose boundaries are not well defined Key idea is to introduce the notion of a degree of membership associated with the elements of a set This degree of membership varies from 0 to and allows modeling the notion of marginal membership Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic U: universe of discourse A fuzzy subset A of U is characterized by a membership function µ A (u) : U [0,] which associates with each element u of U a number µ A (u) in [0,] Let A and B be two fuzzy subsets of U, and A be the complement of A. Then, µ A (u) = - µ A (u) µ A B (u) = max(µ A (u), µ B (u)) µ A B (u) = min(µ A (u), µ B (u)) 6 Πίνακας Συσχέτισης (correlation matrix) και εγγύτητα όρων k k 2. k t k c c 2 c t k 2 c 2 c 22 c t2 : : : : : : : : k t c n c 2n c tn c(i,l) = n(i,l) ni + nl - n(i,l) where: n(i,l): number of docs which contain both ki and kl ni: number of docs which contain ki nl: number of docs which contain kl Πχ n(i,l)=0 => c(i,l)=0 n(i,l)=3, ni=3, nl=9 => c(i,l)=0,3 n(i,l)=3, ni=3, nl=30 => c(i,l)=0, n(i,l)=3, ni=3, nl=3 => c(i,l)= Έτσι έχουµε ορίσει την εγγύτητα (proximity) µεταξύ των όρων 62 4

Fuzzy Information Οισυντελεστέςσυσχέτισηςµαςεπιτρέπουνναορίσουµετοβαθµό membership ενός εγγράφου dj. Σεκάθεόρο i αντιστοιχούµεένα fuzzy set µεχαρ/κήσυνάρτησηµ i Αντο doc dj περιέχειτονόρο k w πουσχετίζεταιισχυράµετον k i τότε c(i,w) ~ µ i (j) ~, µεάλλαλόγιακαιοόρος ki είναικαλόςγιατο dj Τυπικά: µ i (j) = Σ c(i,w) k w dj = - Π ( - c(i,w)) k w dj ( U A ) c c i = IAi U Ai =Ω c c ( U Ai) =Ω IA i 63 Fuzzy Information Έστω q σε DNF q = c v v ck Σύµφωνα µε τη fuzzy set theory: µ q (j)= max(µ c (j),, µ ck (j)) Παρά ταύτα, εδώ προτείνεται η χρήση αθροίσµατος αντί του του max µ q (j)=. 64 5

Παράδειγµα q = ka (kb kc) vec(q dnf ) = (,,) + (,,0) + (,0,0) = vec(cc) + vec(cc2) + vec(cc3) µ q (dj) = µ cc+cc2+cc3 (dj) = - Π ( -µ cci (dj)) i=..3 Da = - ( -µ a (dj) µ b (dj) µ c (dj)) ( -µ a (dj)µ b (dj) (-µ c (dj))) ( -µ a (dj) (-µ b (dj)) (-µ c (dj))) cc3 cc2 cc Dc Db 65 Fuzzy Model: Σύνοψη K={k,,k t } : σύνολοόλωντωνλέξεωνευρετηρίασης Κάθεέγγραφο d j παριστάνεταιµετοδιάνυσµα d j =(w,j,,w t,j ) όπου: w i,j = ανηλέξη k i εµφανίζεταιστοκείµενο d j (αλλιώς w i,j =0) Μια επερώτηση q είναι µια λογική έκφραση στο Κ, πχ: q = k and ( k2 or not k3)) δηλαδή q = k ( k2 k3)) q DNF = (k k2 k3) (k k2 k3) (k k2 k3) q DNF = (,,) (,,0) (,0,0) R(dj,q) = µ q (dj) =Σµ cc (dj) γιακάθεσυζευκτικήσυνιστώσα cc του q DNF µ ki (dj) = - Π ( - c(ki,kw)) k w dj c(ki,kj) καθορίζεται από την συνεµφάνιση των όρων ki και kj στη συλλογή 66 6

Fuzzy Information Models: Conclusion Έχουν συζητηθεί κυρίως στο χώρο της fuzzy theory εν έχουµε επαρκή αποτελέσµατα πειραµατικής αξιολόγησης για να τα αντιπαραβάλλουµε µε τα προηγούµενα µοντέλα 67 Information Models Extended Boolean Model 7

Κίνητρο Extended Boolean Model Το Boolean model είναι απλό και κοµψό αλλά δεν παρέχει κατάταξη Προσσέγιση Όπως και στο fuzzy model, µπορούµε να πάρουµε κατάταξη χαλαρώνοντας τη χαρακτηριστική συνάρτηση των συνόλων Επέκταση του Boolean model µε βάρυνση όρων και µερικό ταίριασµα Συνδιασµόςχαρακτηριστικώντου Vector model καιιδιοτήτωντης Boolean algebra [Salton, Fox, and Wu, 983] 69 Κίνητρο Έστω q = k x ky. Σύµφωνα µε το Boolean model ένα έγγραφο που περιέχει µόνο ένα απότα k x, k y είναιµη-συναφές, καιµάλιστατόσοµη-συναφές, όσο ένα έγγραφο που δεν περιέχει κανένα από τους 2 όρους. 70 8

Έστωδύοόροι k x, k y Μπορούµε να απεικονίσουµε επερωτήσεις και έγγραφα στο 2D χώρο. Έναέγγραφο d j τοποθετείταιβάσειτων, βαρών w x,j και w y,j. Έστω ότι τα βάρη αυτά είναι κανονικοποιηµένα στο [0,], π.χ. : w x,j = tf x,j idf x w y,j = tf y,j idf y For brevity let x = w x,j and y = w y,j Άρα οι συντεταγµένες του dj είναι οι (x,y) 7 Η γενική ιδεά (0,) (,) d j+ (0,) (,) k y k y d j+ d j (0,0) (,0) k x d j (0,0) (,0) k x Έστω q OR =k x v k y Το σηµείο (0,0) είναι η θέση προς αποφυγή. Άρα µπορούµε να θεωρήσουµε την απόσταση του dj από αυτό το σηµείο ως το βαθµό οµοιότητας Έστω q AND =k x Λ k y Το σηµείο (,) είναι η πιο επιθυµητή θέση. Άρα µπορούµε να θεωρήσουµε το συµπλήρωµα της απόστασης του dj από αυτό το σηµείο ως βαθµό οµοιότητας 72 9

Ηγενικήιδεά (ΙΙ) (0,) (,) d j+ (0,) (,) k y k y d j+ d j (0,0) (,0) k x Let q OR =k x v k y 2 x + y sim( q OR, d) = 2 2 d j (0,0) (,0) k x Let q AND =k x Λ k y 2 ( x) + ( y) sim( q AND, d) = 2 2 ( 2 for normalisation to [0,]) 73 Γενικεύοντας την ιδέα (για >2 όρους) Μπορούµε να γενικεύσουµε το προηγούµενο µοντέλο χρησιµοποιώντας την Ευκλείδεια απόσταση στον t-διάστατο χώρο Αυτό µπορεί να γίνει χρησιµοποιώντας p-norms που γενικεύουν την έννοια της απόστασης, όπου p. ιαζευκτικές επερωτήσεις q OR = k V k2 V.. V km Συζευκτικές επερωτήσεις q AND = k Λ k2 Λ... Λ km sim sim p p p ( x x x p + 2 +... + m qor, d) m = p p ( ( x x p ) +... + ( m) qand, d) m = 74 0

Μερικές ενδιαφέρουσες ιδιότητες Μεταβάλλοντας το p, µπορούµε να κάνουµε το µοντέλο να συµπεριφέρεται όπως το Vector, το Fuzzy, ή ενδιάµεσα σε αυτά τα δυο. Αν p = τότε (Vector like) sim(q OR,dj) = sim(q AND,dj) = x +... + xm m Αν p = τότε (Fuzzy like) sim(q OR,dj) = max (x i ) sim(q AND,dj) = min (x i ) Ερώτηση: Που πήγαν οι όροι της επερώτησης; 75 Πιο γενικές επερωτήσεις Έστω q = (k Λ k2) V k3 Εφαρµόζουµε τους ορισµούς σεβόµενοι τη σειρά, εδώ: sim( q ( x ) ( (, d) = p + ( x 2 2 p 2 ) / p p ) ) + x p 3 p Έστω q = (k V 2 k2) Λ k3 K and k2 should be used as in a vector system but the presence of k3 is required 76

Μερικές Παρατηρήσεις Είναι αρκετά ισχυρό µοντέλο µε ενδιαφέρουσες ιδιότητες Η επιµεριστική ιδιότητα δεν ισχύει: q = (k k2) k3 q2 = (k k3) (k2 k3) sim(q,dj) sim(q2,dj) 77 Ισοµετρικές καµπύλες p p p ( x + y ) L L 2 L 2 2 2 x+ y= ( x + y ) = max( x, y) = 78 2

ιάρθρωση ιάλεξης PART (A) Ανάκτηση και Φιλτράρισµα Εισαγωγή στα Μοντέλα Άντλησης Κατηγορίες Μοντέλων Exact vs Best Match Τα κλασσικά µοντέλα ανάκτησης Το Boolean Μοντέλο Στατιστικά Μοντέλα - Βάρυνση Όρων Το ιανυσµατικό Μοντέλο Το Πιθανοκρατικό Μοντέλο PART (B):Εναλλακτικά µοντέλα (Ι)Συνολοθεωρητικάµοντέλα Fuzzy Model Extended Boolean Model (ΙΙ) Αλγεβρικά Μοντέλα PART (C): Latent Semantic Indexing Neural Netwok Model (ΙΙΙ)ΠιθανοκρατικάΜοντέλα Bayesian Network Model Inference Network Model 79 Information Models Latent Semantic Indexing 3

Κίνητρο Classic IR might lead to poor retrieval due to: relevant documents that do not contain at least one index term are not retrieved A document that shares concepts with another document known to be relevant might be of interest The user information need is more related to concepts and ideas than to index terms We want to capture the concepts instead of the words. Concepts are reflected in the words. However: One term may have multiple meanings (polysemy) Different terms may have the same meaning (synonymy) 8 LSI: Η γενική προσέγγιση LSI approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. The goal is to find effective models to represent the relationship between terms and documents. Hence a set of terms, which is by itself incomplete and unreliable, will be replaced by some set of entities which are more reliable indicants. 82 4

Ηιδέα The key idea is to map documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) in this reduced concept space might be superior to retrieval in the space of index terms But how to learn the concepts from data? 83 Παράδειγµα προβολής 2 διαστάσεων σε µία 2.0.5.0 0.5 B........... w.. A 0.5.0.5 2.0 discriminating projection 2.0.5.0 B........ w. A. 0.5 0.5.0.5 2.0 84 5

SVD (Singular Value Decomposition) SVD is applied to derive the latent semantic structure model. What is SVD? A dimensionality reduction technique For more about matrices see: The Matrix Cookbook http://www.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf http://kwon3d.com/theory/jkinem/svd.html http://mathworld.wolfram.com/singularvaluedecomposition.html http://www.cs.ut.ee/~toomas_l/linalg/lin2/node3.html#section0003200000000000000 85 Definitions t: total number of index terms d: total number of documents (Xij): be a term-document matrix with t rows and d columns To each element of this matrix is assigned a weight wij associated with the pair [ki,dj] The weight wij can be based on a tf-idf weighting scheme X d d 2. d d k w w 2 w d k 2 w 2 w 22 w d2 : : : : : : : : k t w t w 2t w dt w i,j [0,] 86 6

Latent Semantic Indexing: Οτρόπος t: total number of index terms d: total number of documents terms documents X = t x d T0 t x m Singular Value Decomposition S m x m D 0 0 m x d m=min(t,d) documents Select first k (<m) singular values terms X^ = T k x k S D k x d t x d t x k 87 t: total number of index terms d: total number of documents terms documents X = t x d T0 t x m Singular Value Decomposition S m x m D 0 0 m x d m=min(t,d) documents Select first k (<m) singular values terms X^ = T k x k S D k x d t x d t x k 88 7

SVD SVD of the term-by-document matrix X: If the singular values of S 0 are ordered by size, we only keep the first k largest values and get a reduced model: Xˆ X = T 0S0D0 ' X = TSD' doesn t exactly match X and it gets closer as more and more singular values are kept This is what we want. We don t want perfect fit since we think some of 0 s in X should be and vice versa. It reflects the major associative patterns in the data, and ignores the smaller, less important influence and noise. 89 LSI Paper example Index terms in italics 90 8

term-document Matrix 9 T 0 92 9

S 0 93 D 0 94 20

SVD with minor terms dropped TS define coordinates for documents in latent space 95 Παρατηρήσεις Η παράµετρος k (<m) πρέπει να είναι: large enough to allow fitting the characteristics of the data small enough to filter out the non-relevant representational details 96 2

Τρόπος Σύγκρισης Όρων και Εγγράφων Τρόπος σύγκρισης 2 όρων: the dot product between two row vectors of reflects the extent to which two terms have a similar pattern of occurrence across the set of document. X terms documents X^ t x d documents Τρόποςσύγκρισηςδύοεγγράφων: dot product between two column vectors of X terms X^ t x d 97 Terms Graphed in Two Dimensions -0.5 0.0 0.5.0.5 system user EPS response time computer survey interface human graph trees minors -2.0 -.5 -.0-0.5 0.0 LSA2.SVD.2dimTrmVectors[,] 22

Documents and Terms -0.5 0.0 0.5.0.5 system c2 c4 c3 user survey response c5time computer interface c human EPS m4 graph m3 trees minors m2 m -2.0 -.5 -.0-0.5 0.0 LSA2.SVD.2dimTrmVectors[,] Change in Text Correlation Correlations between text in raw data c c2 c2 c4 c5 m m2 m3 m4 c.000 c2-0.92.000 c3 0.000 0.000.000 c4 0.000 0.000 0.472.000 c5-0.333 0.577 0.000-0.309.000 m -0.74-0.302-0.23-0.6-0.74.000 m2-0.258-0.447-0.36-0.239-0.258 0.674.000 m3-0.333-0.577-0.408-0.309-0.333 0.522 0.775.000 m4-0.333-0.92-0.408-0.309-0.333-0.74 0.258 0.556.000 Correlations in two-dimensional space c c2 c2 c4 c5 m m2 m3 m4 c.000 c2 0.90.000 c3.000 0.92.000 c4 0.998 0.884 0.998.000 c5 0.842 0.990 0.844 0.809.000 m -0.858-0.568-0.856-0.887-0.445.000 m2-0.853-0.562-0.85-0.883-0.438.000.000 m3-0.852-0.559-0.850-0.88-0.435.000.000.000 m4-0.8-0.497-0.809-0.845-0.368 0.996 0.997 0.997.000 23

Latent Semantic Indexing: Ranking Η επερώτηση q του χρήστη µοντελοποιείται ως ένα ψευδοέγγραφο στον αρχικό πίνακα Χ X d d 2. d d q k w w 2 w d w q k 2 w 2 w 22 w d2 w q2 : : : : : : : : k t w t w 2t w dt w qt 0 Συµπεράσµατα Latent semantic indexing provides an interesting conceptualization of the IR problem It allows reducing the complexity of the underline representational framework which might be explored, for instance, with the purpose of interfacing with the user 02 24

Παρατηρήσεις What is the common and difference between PCA (Principle Component Analysis) and SVD? Both are related to standard eigenvalue-eigenvector, to remove noise or correlation and get the most important info. PCA is on covariance matrix and SVD works on original matrix. 03 ιάρθρωση ιάλεξης PART (A) Ανάκτηση και Φιλτράρισµα Εισαγωγή στα Μοντέλα Άντλησης Κατηγορίες Μοντέλων Exact vs Best Match Τα κλασσικά µοντέλα ανάκτησης Το Boolean Μοντέλο Στατιστικά Μοντέλα - Βάρυνση Όρων Το ιανυσµατικό Μοντέλο Το Πιθανοκρατικό Μοντέλο PART (B):Εναλλακτικά µοντέλα (Ι)Συνολοθεωρητικάµοντέλα Fuzzy Model Extended Boolean Model (ΙΙ) Αλγεβρικά Μοντέλα PART (C): Latent Semantic Indexing Neural Netwok Model (ΙΙΙ)ΠιθανοκρατικάΜοντέλα Bayesian Network Model Inference Network Model 04 25

Information Models Neural Network Model (Μοντέλο Νευρωνικού ικτύου) Neural Network Model ΚλασσικάµοντέλαΑΠ: Έγγραφα και επερωτήσεις ευρετηριάζονται από όρους Η ανάκτηση βασίζεται στο ταίριασµα όρων Κίνητρο: ΕίναιγνωστόότιταΝευρωνικά ίκτυαείναικαλοί pattern matchers 06 26

Human Brain is a Neural Network The human brain is composed of billions of neurons ( million millions of nodes where each node has one thousands edges) Each neuron can be viewed as a small processing unit A neuron is stimulated by input signals and emits output signals in reaction A chain reaction of propagating signals is called a spread activation process As a result of spread activation, the brain might command the body to take physical reactions 07 Neural Networks A neural network is an oversimplified representation of the neuron interconnections in the human brain: nodes are processing units edges are synaptic connections the strength of a propagating signal is modelled by a weight assigned to each edge the state of a node is defined by its activation level depending on its activation level, a node might issue an output signal 08 27

Neural Network for IR Query Terms [From the work by Wilkinson & Hingston, SIGIR 9] Document Documents Terms k d k a k b k a k b k c d j d j+ k c k t d N Μιας κατεύθυνσης ιπλής κατεύθυνσης 09 Neural Network for IR ίκτυο τριών επιπέδων Τα σήµατα διαδίδονται (propagate) στο δίκτυο ο στάδιο διάδοσης: Query terms issue the first signals These signals propagate accross the network to reach the document nodes 2o στάδιοδιάδοσης: Document nodes might themselves generate new signals which affect the document term nodes Document term nodes might respond with new signals of their own, and so on Query Terms k a k b k c Documen t Terms k k a k b k c k t Documen ts d d j d j+ d N 0 28

Μετάδοση σηµάτων Μέγιστητιµήσήµατος = (άρακάνουµεκανονικοποίηση) Οιόροιτηςεπερώτησηςεκπέµπουντοαρχικόσήµαίσοµε Καθορισµός των βαρών στις ακµές: query terms => terms terms => docs Μετάδοση σηµάτων (ΙΙ) Query Terms Document Terms Documents Initial activation level k a k b k c k k a k b k c d d j d j+ Άθροιση σηµάτων. Άρα το activation level στο dj (µετά τον ο γύρο), είναι: w iq = w iq t 2 wiq i= k t d N ij = ij t i= Σηµείωση: τα αρχικά wiq kai wij όπως στο διανυσµατικό µοντέλο (tf-idf) w w w 2 ij t i= w iq w ij The ranking of the classic Vector Space Model! 2 29

Μετάδοση σηµάτων (ΙΙΙ) Query Terms Document Terms Documents Initial activation level k a k b k c k k a k b k c d d j d j+ Άθροιση σηµάτων. Άρα το activation level στο dj (µετά τον ο γύρο), είναι: wiq = w iq t 2 wiq i= k t wij = Η ανάκτηση µπορεί να βελτιωθεί αν επιτρέψουµε στους κόµβους των εγγράφων να εκπέµψουν σήµα (λειτουργία ανάλογη της ανάδρασης συνάφειας) w ij t i= w 2 ij d N t i= w iq w ij A minimum threshold should be enforced to avoid spurious signal generation 3 Μοντέλο Νευρωνικού ικτύου: Επίλογος Model provides an interesting formulation of the IR problem Model has not been tested extensively It is not clear the improvements that the model might provide 4 30