Πανεπιστήμιο Κρήτης Τμήμα Επιστήμης Υπολογιστών. ΗΥ-561 Διαχείριση Δεδομένων στο Παγκόσμιο Ιστό Βασίλης Χριστοφίδης

Πανεπιστήμιο Κρήτης Τμήμα Επιστήμης Υπολογιστών ΗΥ-561 Διαχείριση Δεδομένων στο Παγκόσμιο Ιστό Βασίλης Χριστοφίδης Ονοματεπώνυμο: Αριθμός Μητρώου: Τελική Εξέταση (3 ώρες) Ημερομηνία: Παρασκευή 14 Ιουλίου 2006 Άσκηση 1 (50 μονάδες) Σας δίνεται το παρακάτω μοντέλο οντοτήτων-συσχετίσεων (ΟΣ) για την αναπαράσταση πληροφορίας σχετικής με ασθενείς και τους θεράποντες γιατρούς τους. 1.(15 μονάδες) Μεταφράστε το παραπάνω μοντέλο ΟΣ σε σχήμα XML. Σχεδιάστε κατάλληλους τύπους στοιχείων XML (σε δύο αρχεία patients.xsd και doctors.xsd) για κάθε οντότητα λαμβάνοντας υπόψη τους περιορισμούς ακεραιότητας (κλειδιά, εξωτερικά κλειδιά) καθώς και ότι: Α) Για κάθε Ασθενή (Patient): SSN (1 occurrence, required) Name (1 occurrence, required)

Insurance Plan (1 or more occurrence, required) Address (0 or 1 occurrence, optional) Phone Number (1 occurrence, required) Β) Για κάθε Γιατρό (Doctor): ID (1 occurrence, required) Name (1 occurrence, required) SSN (1 occurrence, required) Office visit fee (values from 10 to 115, required) Specialty (1 occurrence, required) Accepted Insurance plans (0 or more occurrence, optional) <?xml version="1.0"?> <xsd:schema xmlns:xsd=http://www.w3.org/2001/xmlschema xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" xsd:nonamespaceschemalocation="doctors.xsd"> <xsd:element name="doctors"> <xsd:complextype> <xsd:sequence> <xsd:element name="doctor" maxoccurs="unbounded"> <xsd:complextype> <xsd:sequence> <xsd:element name="id" type="xsd:string" minoccurs="1" maxoccurs="1" nillable="false"/> <xsd:key name="doctorskey"> <xsd:selector xpath=".//doctor" /> <xsd:field xpath="id" /> </xsd:key> <xsd:element name="name" type="xsd:string" minoccurs="1" maxoccurs="1" nillable="false"/> <xsd:element name="ssn" type="xsd:string" minoccurs="1" maxoccurs="1" nillable="false"/> <xsd:element name="visitfee" nillable="false"> <xsd:simpletype> <xsd:restriction base="xsd:integer"> <xsd:mininclusive value="10"/> <xsd:maxinclusive value="115"/> </xsd:restriction> </xsd:simpletype> </xsd:element> <xsd:element name="specialty" type="xsd:string" minoccurs="1" maxoccurs="1" nillable="false"/> <xsd:element name="acceptableplan" type="xsd:string" minoccurs="0" maxoccurs="unbounded"/> </xsd:sequence> </xsd:complextype> </xsd:element> </xsd:sequence> </xsd:complextype> </xsd:element> </xsd:schema> <?xml version="1.0"?> <xsd:schema xmlns:xsd=http://www.w3.org/2001/xmlschema xmlns:msdata="urn:schemas-microsoft-com:xml-msdata" xmlns:doc="./doctors.xsd" xsd:nonamespaceschemalocation="patients.xsd"> <xsd:element name="patients"> <xsd:complextype> <xsd:sequence> <xsd:element name="patient" maxoccurs="unbounded"> <xsd:complextype> <xsd:sequence> <xsd:element name="name" type="xsd:string"nillable="false"/> <xsd:element name="ssn" type="xsd:string" nillable="false"

minoccurs="1" maxoccurs="1"/> <xsd:key name="patientskey"> <xsd:selector xpath=".//patient" /> <xsd:field xpath="ssn" /> </xsd:key> <xsd:element name="phone" nillable="false" minoccurs="1" maxoccurs="1"> <xsd:complextype> <xsd:sequence> <xsd:element name="areacode" type="xsd:string"/> <xsd:element name="phoneno" type="xsd:string"/> </xsd:sequence> </xsd:complextype> </xsd:element> <xsd:element name="address" maxoccurs="1"> <xsd:complextype> <xsd:sequence> <xsd:element name="streetno" type="xsd:string"/> <xsd:element name="streetname" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:string" /> </xsd:sequence> </xsd:complextype> </xsd:element> <! Here we translate the relationship has_appointment --!> <xsd:element name="id" type="xsd:string" minoccurs="0" maxoccurs="unbounded"/> <xsd:keyref name="has_appointment" msdata:refrencekey="true" refer="doc:doctorskey"> <xsd:selector xpath=".//patient" /> <xsd:field xpath="id" /> </xs:keyref> <xsd:element name="insuranceplan" minoccurs="1" maxoccurs="unbounded"/> </xsd:sequence> </xsd:complextype> </xsd:element> </xsd:sequence> </xsd:complextype> </xsd:element> </xsd:schema> 2.(35 μονάδες) Δώστε τις εκφράσεις XQuery καθώς και τον τύπο του αποτελέσματός τους για τις παρακάτω επερωτήσεις, υποθέτοντας ότι σας δίνονται δύο σχετικά έγγραφα XML patient.xml και doctor.xml. Σημείωση: Οι ασθενείς διακρίνονται από τον αριθμό κοινωνικής ασφάλισης (SSN) (και για ευκολία θεωρείστε ότι τα ονόματά τους είναι μοναδικά στα δεδομένα του αρχείου patient.xml) ενώ οι γιατροί διακρίνονται από τις ταυτότητές τους (ID). Έτσι στις ακόλουθες επερωτήσεις: Όποτε σας ζητείται να επιστρέψετε μια συγκεκριμένη τιμή για τους ασθενείς, δώστε το όνομα καθώς και τη ζητούμενη τιμή για κάθε ασθενή. Όποτε σας ζητείται να επιστρέψετε μια συγκεκριμένη τιμή για τους γιατρούς, δώστε το όνομα καθώς και τη ζητούμενη τιμή για κάθε γιατρό. a) Δώστε τα ασφαλιστικά ταμεία (insurance plans) για κάθε ασθενή (κάθε ασφαλιστικό ταμείο πρέπει να διαχωρίζεται από ","). Ταξινομήστε το αποτέλεσμα σε αύξουσα σειρά ως προς το όνομα των ασθενών.

FOR $p IN distinct-values(document("patient.xml")//patients/patient) ORDER BY $p/name ascending RETURN <Patient> {$p/name} { Let $d := count($p/insuranceplan) return if($d = 1) then <InsurancePlans>{$p/InsurancePlan[1]/text()}</InsurancePlans> else <InsurancePlans>{$p/InsurancePlan[position() <$d]/text(),", "}, {$p/insuranceplan[position() = $d]/text()}</insuranceplans> } </Patient> element Patient { element Name {xsd:string}, element emails {{text}*} }* b) Δώστε τον αριθμό ασθενών για κάθε ταχυδρομικό κώδικα (zip). Ταξινομήστε το αποτέλεσμα σε φθίνουσα σειρά ως προς τον αριθμό των ασθενών. FOR $z IN distinct-values(document("patient.xml")// Patients/Patient/Address/Zip) LET $n := count(document("patient.xml")//patients/patient/ [Address/Zip=$z]) ORDER BY $n descending RETURN <Patients zip = "{$z}"> <PatientNum>{$n}</PatientNum> </Patients> element Patients { attribute zip {xdt:untypedatomic}, element PatientNum {xsd:integer} }* c) Δώστε τον αριθμό ασθενών για κάθε γιατρό με την προϋπόθεση ότι ο ασθενής ανήκει σ ένα ασφαλιστικό ταμείο με το οποίο ο γιατρός είναι συμβεβλημένος. Ταξινομήστε το αποτέλεσμα σε φθίνουσα σειρά ως προς τον αριθμό των ασθενών. FOR $d IN document("doctor.xml")//doctors/doctor LET $n := count( for $p IN document("patient.xml")//patients/patient Where exists(for $docid IN $p/id where $docid = $d/id and exists(for $dplan IN $d/acceptableplan for $pplan IN $p/insuranceplan where $dplan = $pplan return $dplan) return $docid) return $p) ORDER BY $n descending RETURN <Doctors> <DoctorID>{$d/ID}, <PatientNum>{$n}</PatientNum> </Doctors> element Doctors { element DoctorID {xsd:string}, element PatientNum {xsd:integer} }*

d) Δώστε τους ασθενείς που πρέπει να καταβάλουν αμοιβή (fee) σε τουλάχιστον δύο γιατρούς που έχουν επισκεφτεί (δηλ., τουλάχιστον δύο από τους γιατρούς στον κατάλογό τους δεν είναι συμβεβλημένοι με κανένα από τα ασφαλιστικά ταμεία του ασθενούς). Ταξινομήστε το αποτέλεσμα σε αύξουσα σειρά ως προς το όνομα των ασθενών. FOR $p IN document("patient.xml")//patients/patient WHERE count($p/id) - count( for $docid IN $p/id where exists( for $d IN document("doctor.xml")//doctors/doctor where $d/id = $docid and exists(for $dplan IN $d/acceptableplan where $dplan = $p/insuranceplan[position() <= count($p/insuranceplan)] return $dplan) return $d) return $docid) >=2 ORDER BY $p/name ascending RETURN <Patients>{$p/Name}</Patients> element Patients { element Name {xsd:string} }* e) Δώστε τους ασθενείς που έχουν καταβάλει σ ένα γιατρό αμοιβή (fee) επίσκεψης μεγαλύτερη ή ίση του 100 γιατί ο γιατρός δεν είναι συμβεβλημένος με τα ασφαλιστικά τους ταμεία. Ταξινομήστε το αποτέλεσμα σε αύξουσα σειρά ως προς το όνομα των ασθενών. FOR $p IN document("patient.xml")//patients/patient WHERE exists( for $dd IN $p/id for $d in document("doctor.xml")//doctors/doctor where $d/id = $dd and $d/visitfee > 100 and empty(for $plan IN $d/acceptableplan where $plan = $p/insuranceplan[position() <= count($p/insuranceplan)] return $plan) return $d) ORDER BY $p/name ascending RETURN <Patients>{$p/Name}</Patients> element Patients { element Name {xsd:string} }* f) Δώστε τους ασθενείς που έχουν κλείσει τουλάχιστον μια συνάντηση με έναν δερματολόγο. Ταξινομήστε το αποτέλεσμα σε αύξουσα σειρά ως προς το όνομα των ασθενών. LET $d := document("doctor.xml")//doctors/doctor FOR $p IN document("patient.xml")//patients/patient WHERE count($d[id = $p/id[position()<=count($p/id)] and Specialty = "Dermatology"])!= 0 ORDER BY $p/name ascending RETURN <Patients> {$p/name} </Patients> element Patients { element Name {xsd:string} }*

g) Δώστε τους ασθενείς που έχουν κλείσει συνάντηση με τουλάχιστον δύο γιατρούς της ίδιας ειδικότητας. Ταξινομήστε το αποτέλεσμα σε φθίνουσα σειρά ως προς το όνομα των ασθενών. For $p IN document("patient.xml")//patients/patient Where count(distinct-values( for $dd IN $p/id for $d in document("doctor.xml")//doctors/doctor where $d/id = $dd return $d/specialty)) < count($p/id) ORDER BY $p/name descending RETURN <Patients> {$p/name} </Patients> element Patients { element Name {xsd:string} }* Άσκηση 2 (33 μονάδες) Ο εγκλεισμός (containment) και η ισοδυναμία (equivalence) είναι δύο σημαντικά προβλήματα για δηλωτικές γλώσσες επερωτήσεων XML (όπως και στις σχεσιακές βάσεις δεδομένων). Υποθέστε ότι έχουμε δύο εκφράσεις XPath Q1 και Q2. Η αποτίμηση της Q1 στο έγγραφο XML D, που συμβολίζεται Q1(D), είναι ένα σύνολο κόμβων από το D (χωρίς διάταξη). Ορίζουμε τον εγκλεισμό δύο εκφράσεων XPath ως εξής: Q1 Q2 εάν για κάθε έγγραφο XML D Q1(D) Q2(D) (δηλ. τα αποτελέσματα της Q1 περιέχονται πάντα στα αποτελέσματα της Q2). Εάν Q1 Q2 και Q2 Q1, τότε Q1 και Q2 είναι δύο ισοδύναμες εκφράσεις. α) (18 μονάδες) Σας δίνονται οι ακόλουθες εκφράσεις XPath: Q1: /a[*]/b Q2: /a//b Q3: /a[b][g] Q4: /a[g][b] Q5: /a[b]/*//g Q6: /a[b]//*/g Ποιες από τις παρακάτω σχέσεις εγκλεισμού ισχύουν (απαντήστε με ένα ναι ή όχι) i) Q1 Q2? Yes ii) Q2 Q1? No iii) Q3 Q4? Yes iv) Q4 Q3? Yes v) Q5 Q6? Yes vi) Q6 Q5? Yes

β) (15 μονάδες) Εστιάζοντας σε απλές εκφράσεις XPath, όπως αυτές του προηγούμενου ερωτήματος (δηλ. που χρησιμοποιούν μόνο τούς άξονες child και descendent-or-self καθώς και απλά κατηγορήματα και μπαλαντέρ *), μια έκφραση XPath Q είναι ελάχιστη εάν όλες οι ισοδύναμες εκφράσεις που υπάρχουν είναι τουλάχιστον τόσο μεγάλες όσο και η Q (μετρώντας όλες τις δοκιμές κόμβων (node tests), συμπεριλαμβανομένου του μπαλαντέρ *, και των κατηγορημάτων). i) Είναι ελάχιστη κάποια από τις παραπάνω εκφράσεις XPath Q1... Q6? Στην αντίθετη περίπτωση, δώστε μια μικρότερη ισοδύναμη έκφραση. The only non-minimal expression is Q1. It can be written as /a/b. ii) Υπάρχει μόνο μια μοναδική ελάχιστη μορφή μιας έκφρασης XPath? Δικαιολογήστε γιατί ή δώστε ένα αντι-παράδειγμα. No. Expressions Q5 and Q6 above are two different but equivalent expressions and both are minimal. iii) Ποια είναι η ελάχιστη μορφή της ακόλουθης έκφρασης XPath: /a[b/*][b//g]/b/g A: Equivalent to: a/b/g Άσκηση 3 (30 μονάδες) Σας δίνεται η παρακάτω επερώτηση XQuery η οποία επιστρέφει τα αεροδρόμια με την μεγαλύτερη κίνηση στις 26/06/2006 (βασισμένη στον αριθμό των αφίξεων και αναχωρήσεων κάθε αεροδρομίου): LET $results := ( <traffic> { for $a in doc('flights.xml')//airport let $c :=count( doc('flights.xml')//flight[((/source/text() eq $a/@airid) or (./destination/text() eq $a/@airid)) and (date/text() eq '2006-06-26') ] ) return <result> {$a} <count>{$c}</count> </result> } </traffic>) RETURN $results/result[xs:integer(./count/text()) eq xs:integer(max($results//count/text()))] Σημείωση: Για την αλγεβρική μετάφραση μιας επερώτησης XQuery χρειαζόμαστε τελεστές που αναπαριστούν τις εκφράσεις πλοήγησης XPath (στους άξονες child, descendant or self) καθώς και τις εκφράσεις XQuery για τον έλεγχο ροής (ifthenelse), το ταίριασμα (match) μονοπατιών και την κατασκευή κόμβων (nodeconstructor) στο αποτέλεσμα. 1. Μια έκφραση XPath μπορεί να αναπαρασταθεί με μια γραμμική δομή που εκφράζει όλα τα βήματα πλοήγησης σε ένα μονοπάτι και η οποία διαβάζεται από κάτω προς τα επάνω: children, descendant-or-self είναι λειτουργίες που αντιστοιχούν στα /,// Λειτουργία ταιριάσματος.

- match(text()) επιστρέφει όλους τους κόμβους κειμένου από ένα δεδομένο σύνολο κόμβων, - match(count) επιστρέφει όλους τους κόμβους με όνομα count από ένα δεδομένο σύνολο κόμβων. 2. Οι σταθερές αναπαριστώνται χρησιμοποιώντας τη λειτουργία constant 3. Οι λειτουργίες και οι τελεστές αναπαριστώνται σαν δέντρα, όπου η παράμετρός τους είναι τα παιδιά του δέντρου. 4. Κάθε λειτουργία έχει μια τιμή επιστροφής την οποία περνούν στους κόμβους προγόνους του δέντρου. α) (10 μονάδες) Κατασκευάστε ένα πλάνο εκτέλεσης για την παραπάνω επερώτηση μεταφράζοντας τις εκφράσεις XQuery σε αλγεβρικές εκφράσεις ως ακολούθως: Εκφράσεις LET/RETURN: (a return belongs to a let or for ) let $result := <order>abc</order> return $result/text() Notes: we represent the let variable as a label on the left edge; the right edge gives the result; the right edge(corresponding to the initial return ) cannot be executed before the left edge. Εκφράσεις FOR : for $a in doc( flights.xml )//Airport let $c:= 5 return $result <result>{$c}</result> Notes: - The for variable is a label of the left edge. The left edge evaluates to a sequence of nodes. - For each mapping of $a to one of the items in the sequence generated by the left size, the right edge is executed - the result of the for is the union of all results returned on the right side for each mapping - the return belongs to the let statement, so the result returned at each iteration is the right edge of the let node

Εκφράσεις ifthenelse: let $results := [ ] return $results/result[xs:integer(. /count/text())eq2] Note: the expression is equivalent to : for $r in $results/result where xs:integer($r/count/text()) eq 2 return $r - a temporary variable, $r, is introduced - an ifthenelse function takes a Boolean expression, and returns the result of the second branch when the expression is true, and the result of the thirds branch when the result is false (in our case, we return the empty sequence)

(2006-06-26)

β) (20 μονάδες) Διαλέγοντας σαν συνάρτηση κόστους τον αριθμό των λειτουργιών του πλάνου εκτέλεσης, βελτιστοποιήστε το πλάνο που δώσατε στο προηγούμενο ερώτημα. Σημείωση: Ο στόχος αυτής της επαναδιατύπωσης (rewriting) του πλάνου είναι να προσδιορίσουμε τις λειτουργίες που μπορούν να αφαιρεθούν (δηλ. που δεν είναι απαραίτητες) ή να βελτιστοποιηθούν (π.χ. μερικές διαδικασίες μπορούν να εκτελεσθούν μόνο μία φορά, έξω από έναν βρόχο, αντί της εκτέλεσής τους σε κάθε επανάληψη του βρόχου), κλπ. προκειμένου να πετύχουμε αποδοτικότερη αποτίμηση της επερώτησης. Ιδιαίτερα μας ενδιαφέρει να αναγνωρίσουμε κοινές υποεκφράσεις (δηλ., εκφράσεις που είναι ίδιες και γι αυτό δεν χρειάζεται να εκτελεστούν πολλές φορές σε διαφορετικά σημεία της ίδιας ερώτησης). Οι σταθερές επίσης είναι μια ειδική περίπτωση των κοινών υποεκφράσεων οι οποίες αποτιμώνται μόνο μία φορά έξω από όλους τούς βρόχους. Query rewriting: We will eliminate common subexpressions from the initial query (expressions which can be evaluated once instead of each time they appear). Depending on the query engine, several other optimization steps might be applied to the expression. The common subexpressions do not depend on the variable of the for expression where they appear: 1. Identifying common subexpressions: - all constants: flights.xml, 2006-06-26 - doc( flights.xml ) : evaluated both: for getting the Flights and getting the Airports doc( flights.xml )//Flights doc( flights.xml )//Airports - doc( flights.xml )//Flight: currently, executed for each iteration of for number 1 - $a/airid: computed for each Flight in for number 2 - max($results//count/text()): computed for each iteration of for number 3 2. Rewriting the query plan - For each common subexpression, we introduce a let statement and a temporary variable. The new let statement computes the value of the expression once for each context where it was executed several times. - We use the temporary variable wherever the common sub-expression previously appeared. NOTE: For simplification, we will only rewrite the expressions without rewriting the descendantor-self axis. Figure below shows the common subexpressions (in red) and where the new let statements should be introduced. Figure 3 shows the rewritten query plan.

(2006-06-26)

Figure below shows the rewritten query plan where sub-expressions are optimized (2006-06-26)

The rewritten query is equivalent to the following XQuery: (new variables appear in red and bold; the usage of new variables appears in blue) let $results := (<traffic> { (: all descendants :) let $_doc := doc('flights.xml')//descendant-or-self::* let $_flights := $_doc/flight for $a in $_doc/airport let $_airid := $a/@airid let $c := count($_flights[((./source/text() eq $_airid) or (./destination/text() eq $_airid)) and (date/text() eq '2006-06-26') ] ) return <result> {$a} <count>{$c}</count> </result> } </traffic>) let $_max := xs:integer(max($results//count/text())) return $results/result[xs:integer(./count/text()) eq $_max] One way to compute the cost of the initial query plan is to add 1 for each node in the query plan (for uniformity, we add 1 for the let nodes, also for the root of the for). The cost of the for expressions is computed as : 1 (=cost of the root node) + Cost of left part + count(number of items in left part) * cost(right part) In the following computation, we take into account the number of Airports and number of Flights : Flights: 41 Airports: 10 Initial Cost (before rewrite) Cost(Q1_initial) = 2 + cost(for1) + cost(for3) = 2 + 12,425 + 174 = 12,601 Cost(for1) = 1 + 4 (=left side) + no_of_airports * cost(right_side) = 5 + 10 * (7 + cost(for2)) = 5 +10 * (7+1235) = 12,425 Cost(for2) = 1 + 4 (= left side) + no_of_flights * 30 = 5 + 41 * 30 = 1235 Cost(for3) = 1 + 3 (= left side) + no_of_<result>_elements * (cost(right_side)) = 1 + 3 + 10 * 17 = 174 Cost after Query Rewrite Cost(Q1_rewrite) = 9 + cost(for1) + 8 + cost(for3) = 17 + 10793 + 114 = 10924 Cost(for1) = 1 + 2 + no_airports * cost(right_side) = = 3 + 10 * (6 + cost(for2) + 5) = = 3 + 10* (11 + cost (for2)) = = 3 + 10 * (11 + 1068) = 3 +10 * 1079 = 10793 cost(for2) = 1 + 1 + no_flights * cost(right_side) = 2 + 41 * 26 = 1068 cost(for3) = 1 + 3 + no_of_<result>_elements * (cost(right_side)) = = 4 + 10 * 11 = 114 Of course, cost (Q1_rewrite) < cost(q1_initial), so the rewrite helps to gain the cost. This computation included actually all operations, even though the Boolean operations might not be always executed. However, it gives an impression on the technique of query rewriting.