ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ ΚΑΙ ΠΛΗΡΟΦΟΡΙΚΗΣ ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ, ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ

Transcript

1 ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ ΚΑΙ ΠΛΗΡΟΦΟΡΙΚΗΣ ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ, ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ Algorithms for the fast estimation of statistical leverage scores Αλγόριθμοι για την ταχεία εκτίμηση τιμών στατιστικής μόχλευσης Η ΠΑΡΟΥΣΑ ΕΡΓΑΣΙΑ ΚΑΤΑΤΙΘΕΤΑΙ ΩΣ ΜΕΡΟΣ ΤΩΝ ΥΠΟΧΡΕΩΣΕΩΝ ΓΙΑ ΤΗΝ ΑΠΟΚΤΗΣΗ ΤΟΥ ΜΕΤΑΠΤΥΧΙΑΚΟΥ ΔΙΠΛΩΜΑΤΟΣ ΕΙΔΙΚΕΥΣΗΣ ΕΠΙΣΤΗΜΗ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑ ΥΠΟΛΟΓΙΣΤΩΝ ΤΟΥ ΤΜΗΜΑΤΟΣ ΜΗΧΑΝΙΚΩΝ Η/Υ ΚΑΙ ΠΛΗΡΟΦΟΡΙΚΗΣ ΤΟΥ ΠΑΝΕΠΙΣΤΗΜΙΟΥ ΠΑΤΡΩΝ Σόμπτσυκ Αλέξανδρος Τριμελής Επιτροπή Καθηγητής, Ευστράτιος Γαλλόπουλος (επιβλέπων) Αναπληρωτής καθηγητής, Ιωάννης Καραγιάννης Επίκουρος καθηγητής, Εμμανουήλ Ψαράκης Πάτρα, Φεβρουάριος 2017

2 2 Πανεπιστήμιο Πατρών, Τμήμα Μηχανικών Η/Υ και Πληροφορικής, Σόμπτσυκ Αλέξανδρος 2017 Με την επιφύλαξη παντός δικαιώματος

3 Περίληψη Στην παρούσα εργασία μελετώνται αλγόριθμοι για την ταχεία εκτίμηση τιμών μόχλευσης σε σύνολα δεδομένων. Οι τιμές στατιστικής μόχλευσης αποτελούν ισχυρό εργαλείο για την ανάλυση δεδομένων και τη στατιστική και έχουν χρησιμοποιηθεί επιτυχώς για τον εντοπισμό έκτοπων τιμών σε σύνολα δεδομένων, εύρεση σημαντικών κόμβων σε γράφους, ενώ πρόσφατα έχουν εφαρμοσθεί σε αλγόριθμους τυχαιοποιημένης γραμμικής άλγεβρας. Για την κατασκευή εκτιμητών αναλύουμε διάφορες τεχνικές μείωσης διαστατικότητας που χρησιμοποιούν τυχαιότητα σε συνδυασμό με επαναληπτικές μεθόδους για την επίλυση γραμμικών συστημάτων με πολλά δεξιά μέλη. Βασισμένοι σε αυτές τις τεχνικές προσπαθούμε να προσπεράσουμε συγκεκριμένους περιορισμούς που εντοπίζονται στις μέχρι στιγμής βέλτιστες προσεγγίσεις και προτείνουμε έναν αλγόριθμο ο οποίος αποδεδειγμένα επιστρέφει καλές εκτιμήσεις των τιμών μόχλευσης, παρουσιάζει καλή απόδοση σε υπολογισμούς κλίμακας σε παράλληλα/κατανεμημένα περιβάλλοντα και διαχειρίζεται αποδοτικά αραιά μητρώα. Παρουσιάζουμε τα αποτελέσματά μας σε τεχνητά και πραγματικά σύνολα δεδομένων και παρέχουμε σχολιασμό των αποτελεσμάτων και συζητήσεις σχετικά με τα πλεονεκτήματα και μειονεκτήματα διαφόρων αλγορίθμων. 3

4 4

5 Abstract In this thesis we consider algorithms for fast estimations of leverage scores. Statistical leverage scores are a powerful tool for data analysis and statistics and have been successfully used for outlier detection in datasets, locating important nodes in graphs and more recently applied to numerical linear algebra algorithms. In order to build estimators, we consider dimensionality reduction techniques that use randomization in combination with iterative methods for solving linear systems with multiple right hand sides. Based on these techniques we try to overcome certain limitations of the current state-of-the-art algorithms and propose an approach which provably returns good estimations of leverage scores, scales well in parallel/distributed environments and effectively utilizes sparsity. We present our results on synthetic and real world data sets and evaluate its performance, and discuss the advantages and drawbacks relative to all considered approaches. 5

6 6

7 Ευχαριστίες Με το πέρας αυτής της εργασίας θέλω να ευχαριστήσω όλους τους ανθρώπους, καθηγητές, συνεργάτες, φίλους και γνωστούς που μου με στήριξαν και με βοήθησαν να φέρω σε πέρας το ΜΔΕ. Πρώτα θέλω να ευχαριστήσω την οικογένειά μου για την οικονομική και ψυχική υποστήριξη που μου παρείχαν καθ'όλη τη διάρκεια των σπουδών μου. Ευχαριστώ επίσης τον επιβλέποντα καθηγητή Ευστράτιο Γαλλόπουλο για όλες τις ευκαιρίες που μου δόθηκαν κατά τη διάρκεια της συνεργασίας μας. Του είμαι ευγνώμων για όλες τις γνώσεις που απέκτησα καθώς και για τη στήριξη και για τις συμβουλές που μου παρείχε όλο αυτό το διάστημα σχετικά με ακαδημαϊκά και διάφορα άλλα ζητήματα. Θέλω να ευχαριστήσω τα μέλη της τριμελούς εξεταστικής επιτροπής της μεταπτυχιακής εργασίας, επίκουρο καθηγητή Εμμανουήλ Ψαράκη και αναπληρωτή καθηγητή Ιωάννη Καραγιάννη. Ένα μεγάλο ευχαριστώ στο Βασίλειο Καλατζή για τις εκτενείς συζητήσεις μας σχετικά με διάφορα θέματα που μελετήθηκαν στην εργασία και για όλες τις συμβουλές και τα σχόλια που μου παρείχε. Ευχαριστώ επίσης τον καθηγητή Πέτρο Δρινέα για τις συζητήσεις, παρατηρήσεις και σχόλια σχετικά με το περιεχόμενο της εργασίας, τη βιβλιογραφία και την εκτέλεση πειραμάτων. 7

8 8 Θέλω επίσης να ευχαριστήσω τους Κίμωνα Φουντουλάκη, Ευγενία Κοντοπούλου και Fred Roosta για τις συζητήσεις μας και διάφορα σχόλια. Ευχαριστώ τον καθηγητή Χρήστο Ζαρολιάγκη που μας παρείχε πρόσβαση σε υπολογιστικούς πόρους για την εκτέλεση πειραμάτων. Θέλω να ευχαριστήσω τον Δρ. Γιώργο Κόλλια για την εξαιρετική του υποστήριξη και τις συμβουλές σχετικά με βιβλιοθήκες λογισμικού και γενικές κατευθύνσεις για την εκτέλεση πειραμάτων. Θέλω να ευχαριστήσω ιδιαίτερα το ίδρυμα Υποτροφίες Ανδρέας Μεντζελόπουλος για το Πανεπιστήμιο Πατρών για τη χρηματοδότηση αυτού του ΜΔΕ με υποτροφία. Τέλος, θέλω να ευχαριστήσω τη Μαριάννα. Η υποστήριξή της ήταν ένας από τους σημαντικότερους παράγοντες που έφεραν σε πέρας αυτό το ΜΔΕ. Επίσης είμαι ευγνώμων σε όλους μου τους φίλους, συμφοιτητές και συνεργάτες για την ενθάρυνση και υποστήριξή τους όλο αυτό το διάστημα. Αλέξανδρος Σόμπτσυκ, Πάτρα 2017.

9 Acknowledgements With the end of this Thesis I want to thank all the people, faculty, co-workers, friends and family who helped me finish my MSc. First of all I want to thank my family for their love and support during the years of my studies. My advisor, professor Stratis Gallopoulos for all the opportunities that I was given during this period of working together. I want to thank him for all the knowledge I received and for all the support and advising he offered me during this period concerning academic and other various matters. I want to thank assistant professor Emmanouil Z. Psarakis and associate professor Ioannis Caragiannis for accepting to be part of the examination committee and for their overall contribution. I especially acknowledge Vasilios Kalantzis for our thorough discussions and for all the comments and advice he gave me. Professor Petros Drineas for many observations and comments concerning existing literature and experiments. Kimon Fountoulakis, Fred Roosta and Eugenia Kontopoulou for our discussions and various comments. Professor Christos Zaroliagis for granting us access to computational resources to run our experiments. 9

10 10 Dr. George Kollias for excellent support and advice concerning software implementations and general guidelines for experiments. I want to especially thank Andreas Mentzelopoulos scholarships for University of Patras for financially supporting this research with scholarship. I lastly and mostly want to thank Marianna. Her support is one of the most important factors which led to the end of this MSc. Also all my friends, fellow students and officemates for their encouragement and support. Aleksandros Sobczyk, Patras 2017.

11 Contents Ευχαριστίες 8 Acknowledgements 10 Εισαγωγή 15 1 Introduction Applications Algorithms Contribution Some indicative results Outline Notation Dimensionality reduction Johnson-Lindenstrauss transforms Subspace embeddings Subsampled Randomized Hadamard Transform Sparse Embedding Matrix

12 12 CONTENTS 3 Least squares problems with multiple right hand sides Rank deficiency and leverage scores computation Solving each system independently Complexity Block-CG Complexity Block-seed CG Complexity Preconditioning Constructing preconditioners using randomization Jacobi preconditioning with sparse JLTs Preconditioning least squares using a Gaussian sketch Preconditioned BCG Complexity Algorithms State of the art 1: A(SA) Π State of the art 2: (AA ) q Π Diagonal estimator framework Matrix functions Proposed approach 1: Diagonal estimation based Expectation and variance of the estimated values Complexity Proposed approach 2: Row norm estimation based Bounds for the estimated values Complexity Comparison

13 CONTENTS 13 5 Experiments Approximation Accuracy α, τ, γ, and the number of iterations Performance evaluation Real world datasets, parallelization and scaling Concluding Remarks Failed attempts and lessons learned Future work Αʹ Supplementary proofs 85 Αʹ.1 Expectation and Variance of S S where S is a SRHT Αʹ.2 Expectation and Variance of S S where S is a SEM Βʹ Notes on implementations 89 Βʹ.1 SJLT Βʹ.2 SRHT Βʹ.3 SEM

14 14 CONTENTS

15 Εισαγωγή Στην παρούσα Διπλωματική Εργασία μελετώνται αλγόριθμοι για την ταχεία εκτίμηση τιμών μόχλευσης σε σύνολα δεδομένων. Οι τιμές στατιστικής μόχλευσης αποτελούν ισχυρό εργαλείο για την ανάλυση δεδομένων και τη στατιστική και έχουν χρησιμοποιηθεί επιτυχώς για τον εντοπισμό έκτοπων τιμών (outliers) σε σύνολα δεδομένων [10], [11], [12], εύρεση σημαντικών κόμβων σε γράφους [31], ενώ πρόσφατα έχουν εφαρμοσθεί σε αλγόριθμους τυχαιοποιημένης αριθμητικής γραμμικής άλγεβρας [20], [31], [45]. Στις εφαρμογές, τα δεδομένα είναι συχνά χρήσιμο να θεωρούνται ως σημεία στον R d και να αποθηκεύονται ως διανύσματα ή ως στήλες ενός μητρώου A R n d. Οι τιμές μόχλευσης αναδεικνύουν την επιρροή του κάθε σημείου στη γραμμή βέλτιστης προσέγγισης των δεδομένων. Στο Σχήμα 1 φαίνεται μία γραφική αναπαράσταση. Σημεία τα οποία βρίσκονται μακρυά από τη γραμμή έχουν υψηλές τιμές μόχλευσης. Ορισμός 1. Έστω μητρώο A R n d, όπου n > d. Έστω U R n d μητρώο τέτοιο ώστε οι στήλες του αποτελούν ορθοκανονική βάση για το χώρο στηλών του A. Η τιμή μόχλευσης της i-οστής γραμμής του A, θ i για i = 1,..., n, ορίζεται ως θ i = U (i) 2 2, (1) όπου U (i) η i-οστή γραμμή του U και 2 είναι η Ευκλείδεια νόρμα διανύ- 15

16 16 CONTENTS σματος. Η μέγιστη τιμή μόχλευσης µ = max 1 i n θ i ονομάζεται συνοχή του μητρώου (matrix coherence). Figure 1: Ευθεία που προσεγγίζει ένα σύνολο σημείων σύμφωνα με τις Ευκλείδειες αποστάσεις.² Η πιο απλή προσέγγιση για τον ακριβή υπολογισμό τιμών μόχλευσης είναι μέσω της παραγοντοποίησης QR ή της SVD για την κατασκευή ορθοκανονικής βάσης για το χώρο στηλών του A. Μία τέτοια προσέγγιση έχει πολυπλοκότητα O(nd 2 ) πράξεις αριθμητικής κινητής υποδιαστολής (α.κ.υ.), κόστος το οποίο μπορεί να είναι απαγορευτικό όταν υπάρχουν πάρα πολλά σημεία υψηλής διάστασης. Σε πρόσφατη βιβλιογραφία έχουν προταθεί αλγόριθμοι για την ταχεία εκτίμηση των τιμών μόχλευσης [30], [20], [22], [13]. Πιο συγκεκριμένα, έχουν προταθεί αλγόριθμοι οι οποίοι επιστρέφουν προσεγγίσεις θ i των τιμών μόχλευσης θ i για τις οποίες ισχύει με μεγάλη πιθανότητα η ακόλουθη ανισότητα θ i θ i ϵθ i (2) με κόστος υπολογισμού o(nd 2 ). ² Το σχήμα προέρχεται από το άρθο του wikipedia Regression analysis.

17 CONTENTS 17 Εφαρμογές Όπως συζητήθηκε ήδη, οι τιμές μόχλευσης έχουν μελετηθεί εκτενώς στη στατιστική γραμμικής παλινδρόμισης για την εύρεση έκτοπων στοιχείων σε σύνολα δεδομένων ([10], [11]). Δεδομένα με υψηλή μόχλευση στη βέλτιστη γραμμή προσέγγισης (best-fit line) μπορεί είτε να είναι αθέμιτα είτε μεγάλης σημασίας. Ένας χονδρικός κανόνας δηλώνει ότι τέτοια στοιχεία μπορούν να εντοπισθούν εάν η τιμή μόχλευσης τους είναι 2 ή 3 φορές μεγαλύτερη ή ίση της μέσης τιμής μόχλευσης όλου του συνόλου δεδομένων, π.χ. εάν θ i > 2d/n [43], [12]. Σε πιο πρόσφατη βιβλιογραφία [8], [41] προτείνεται η άποψη ότι οι τιμές μόχλευσης αναδεικνύουν κατά πόσον ένα γραμμικό μοντέλο είναι κατάλληλο για κάποιο σύνολο δεδομένων. Για παράδειγμα, η ύπαρξη μη ομοιόμορφα κατανεμημένων τιμών μόχλευσης, ενδεχομένως δηλώνουν ότι δεν είναι κατάλληλη μια τέτοια προσέγγιση. Στην περιοχή της ανάλυσης γράφων, έστω κάποιος γράφος G(V, E) όπου V το σύνολο d κόμβων και E το σύνολο n ακμών όπου κάθε ακμή διαθέτει κάποια τιμή βάρους w. Το μητρώο γειτνίασης ακμών B του G ορίζεται ως το μητρώο μεγέθους n d όπου κάθε γραμμή αναπαριστά κάποια ακμή του E και διαθέτει ακριβώς 2 μη μηδενικά στοιχεία τα οποία αντιστοιχούν στους κόμβους οι οποίοι συνδέονται από τη συγκεκριμένη ακμή. Ορίζοντας ως W το διαγώνιο n n μητρώο όπου κάθε στοιχείο διαθέτει το βάρος της αντίστοιχης ακμής, το Λαπλασιανό μητρώο του G ορίζεται ως L = B W B. Ο βαθμός ενός κόμβου ορίζεται ως ο αριθμός ακμών όπου συνδέονται με αυτόν. Οι σημαντικοί κόμβοι τείνουν να έχουν μεγάλο βαθμό ενώ σημαντικές ακμές είναι αυτές οι οποίες συνδέουν μεγάλες συστάδες κόμβων. Μία χρήσιμη έννοια που μπορεί να αναδείξει σημαντικές ακμές είναι οι ενεργές αντιστάσεις (effective resistances), οι οποίες είναι τα διαγώνια στοιχεία του μητρώου R = BL B. Είναι εύκολο να

18 18 CONTENTS αποδειχθεί ότι οι ενεργές αντιστάσεις είναι ανάλογες των τιμών μόχλευσης του μητρώου W 1/2 B [35]. Οι τιμές μόχλευσης είναι σημαντικές και στην περιοχή της τυχαιοποιημένης αριθμητικής γραμμικής άλγεβρας (RNLA). Στην περιοχή αυτή, μία στρατηγική που χρησιμοποιείται συχνά είναι η δειγματοληψία γραμμών/στηλών σύμφωνα με κάποια κατανομή σπουδαιότητας. Η χρήση τιμών μόχλευσης μπορεί να οδηγήσει σε βελτίωση της επίδοσης αλγορίθμων της περιοχής αυτής σε σύγκριση με την ομοιόμορφη δειγματοληψία [31], [45], [15]. Έχει αποδειχθεί μάλιστα ότι η δειγματοληψία μπορεί να γίνει ντετερμινιστικά χρησιμοποιώντας τιμές μόχλευσης για το πρόβλημα της προσέγγισης χαμηλής τάξης μητρώου [37]. Μείωση διαστατικότητας Οι αλγόριθμοι που πετυχαίνουν την ανισότητα (2) χρησιμοποιούν τεχνικές μείωσης διαστατικότητας με χρήση τυχαιότητας. Πιο συγκεκριμένα, στη βιβλιογραφία έχει γίνει εκτενής μελέτη μεθόδων που βασίζονται στη χρήση μητρώων με τυχαία στοιχεία τα οποία πολλαπλασιάζουν το μητρώο δεδομένων μειώνοντας τη διαστατικότητα, ενώ διατηρούνται κατά προσέγγιση συγκεκριμένα χαρακτηριστικά όπως Ευκλείδειες αποστάσεις, μήκη διανυσμάτων και ιδιάζουσες τιμές. Αναφέρουμε δύο βασικές έννοιες. Η πρώτη είναι οι μετασχηματισμοί Johnson- Lindenstrauss (JLT) [26]. Πρόκειται για τυχαία μητρώα τα οποία μετασχηματίζουν σύνολα διανυσμάτων από το χώρο R d στον R r, όπου r < d, διατηρώντας τα μεταξύ τους εσωτερικά γινόμενα. Πιο συγκεκριμένα Ορισμός 2. Ένα τυχαίο μητρώο Π μεγέθους r n είναι μετασχηματισμός Johnson-Lindenstrauss με παραμέτρους ϵ, δ, f, ή JLT(ϵ, δ, f), εάν με πιθανότητα

19 CONTENTS 19 τουλάχιστον 1 δ, για κάθε υποσύνολο f στοιχείων V του R n ισχύει η ανισότητα Πv, Πw v, w ϵ v 2 w 2 για κάθε v, w V. Θέτοντας w = v συμπεραίνουμε ότι οι μετασχηματισμοί αυτοί διατηρούν κατά προσέγγιση και τα μήκη των διανυσμάτων. Η δεύτερη έννοια είναι τα μητρώα ενσωμάτωσης υπόχωρου (subspace embeddings SE) [40]. Η διαφορά με τους μετασχηματισμούς JLT είναι ότι διατηρούν τις Ευκλείδειες αποστάσεις σε έναν ολόκληρο υπόχωρο, έναντι ενός πεπερασμένου συνόλου διανυσμάτων. Ο χώρος αυτός περιγράφεται από το χώρο στηλών ενός μητρώου A. Ορισμός 3. Δεδομένου μητρώου A μεγέθους n d, ένα ϵ-se για το χώρο στηλών A είναι ένα μητρώο S τέτοιο ώστε για κάθε x R d (1 ϵ) Ax 2 2 SAx 2 2 (1 + ϵ) Ax 2 2 Αλγόριθμοι Επιστρέφοντας στον υπολογισμό τιμών μόχλευσης, οι πιο αποδοτικοί αλγόριθμοι που έχουν προταθεί μέχρι στιγμής εφαρμόζουν JLTs ή/και SE έτσι ώστε να μειωθεί η διαστατικότητα του μητρώου δεδομένων και στη συνέχεια εκτελούν πράξεις στο αποτέλεσμα που είναι ένα μητρώο μικρότερου μεγέθους. Έτσι μειώνεται η πολυπλοκότητα ενώ επιστρέφονται με μεγάλη πιθανότητα αποδεκτές προσεγγίσεις της πραγματικής λύσης. Σε προηγούμενες εργασίες, οι Holdonak et al. δίνουν αποτελέσματα σχετικά με τις διαταράξεις τον τιμών μόχλευσης χρησιμοποιώντας ως υπολογιστικό πυρήνα την παραγοντοποίηση QR [23]. Για μεγάλα μητρώα τέτοιου είδους προσέγγιση μπορεί να είναι ιδιαίτερα χρονοβόρα. Στην εργασία [30] ³ οι Malik- Magdon Ismail et al. περιγράφουν έναν αλγόριθμο για τον προσεγγιστικό υπο- ³Η εργασία αυτή δεν έχει δημοσιευθεί μέχρι στιγμής.

20 20 CONTENTS λογισμό τιμών μόχλευσης πετυχαίνοντας την ανισότητα (2). Μετέπειτα, στην εργασία [20] οι Drineas et al. επιτυγχάνουν ακόμη μικρότερη θεωρητική πολυπλοκότητα εκμεταλλευόμενοι τις ιδιότητες του γενικευμένου αντίστροφου Moore-Penrose. Οι Clarkson et al. [13] προτείνουν ένα καινούργιο μητρώο για τη μείωση διαστατικότητας το οποίο είναι εξαιρετικά αραιό και εξ αυτού μπορεί να πολλαπλασιαστεί σε πολύ μικρό χρόνο με το μητρώο δεδομένων. Το μειονέκτημα αυτού του μητρώου είναι ότι το μέγεθος του μητρώου που προκύπτει μετά τον πολλαπλασιασμό είναι ανάλογο του τετραγώνου της διάστασης d του A ([33] ⁴) και συνεπώς είναι πιο χρήσιμο σε εξαιρετικά ``ψηλά και λεπτά'' μητρώα όπου n d 2. Εκτενής ανάλυση τέτοιου είδους αλγορίθμων έχει γίνει στην εργασία [22], όπου οι Gittens et al. προτείνουν αλγόριθμο για τον ταχύ προσεγγιστικό υπολογισμό τιμών μόχλευσης της βέλτιστης ``τάξης-k'' προσέγγισης του μητρώου. Πρόσφατα οι Drineas et al. δίνουν φράγματα για παρόμοιες προσεγγίσεις μειωμένης τάξης με βάση τη θεωρία υπόχωρων Krylov [19]⁵. Αναφέρουμε συνοπτικά 2 αλγόριθμους που έχουν προταθεί στις αναφορές [20], [22]. Αλγόριθμος 1 [20] 1: Υπολογισμός του μητρώου B = SA, όπου S είναι SE. 2: Υπολογισμός της SVD του B = UΣV. 3: Υπολογισμός του μητρώου C = V Σ 1 Π όπου Π είναι JLT για n διανύσματα. 4: Επιστροφή των Ευκλείδειων νορμών των γραμμών του AC. ⁴Οι Clarkson et al. αρχικά απέδειξαν ότι η μειωμένη διάσταση θα είναι της τάξης O(d 4 ) και βελτιώθηκε αργότερα σε Ω(d 2 ) από τους Nelson et al. ⁵Η εργασία αυτή δεν έχει δημοσιευθεί μέχρι στιγμής.

21 CONTENTS 21 Αλγόριθμος 2 [22] 1: Υπολογισμός του μητρώου B = AΠ όπου το Π είναι SE για τη βέλτιστη k-τάξης προσέγγιση του A. 2: Υπολόγισε το C = (AA ) q B, όπου q 0 είναι ακέραιος. 3: Επιστροφή των Ευκλείδειων νορμών των γραμμών των γραμμών C. Η πολυπλοκότητα των δύο Αλγορίθμων είναι o(nd 2 ). Παρόλ'αυτά έχουν ορισμένα μειονεκτήματα. Το βασικό μειονέκτημα του Αλγόριθμου 1 είναι ότι το μέγεθος του B είναι O(d log(d/δ)/ϵ 2 ) d, που στην πράξη μπορεί να είναι μεγαλύτερο από το A. Όσον αφορά τον Αλγόριθμο 2, βασική του αδυναμία είναι ότι επιστρέφει προσεγγιστικές τιμές μόχλευσης ως προς τη βέλτιστη τάξης-k προσέγγισης του A, αντί για τις τιμές μόχλευσης του A. Συνεισφορά Στην παρούσα Διπλωματική Εργασία μελετώνται τα πλεονεκτήματα και οι περιορισμοί των αλγορίθμων που αναφέρθηκαν. Συνδυάζοντας ιδέες από την υπάρχουσα βιβλιογραφία προτείνουμε έναν αλγόριθμο ο οποίος έχει τα εξής χαρακτηριστικά: 1. Επιστρέφει καλές προσεγγίσεις των τιμών μόχλευσης ψηλών και λεπτών μητρώων πλήρους τάξης. 2. Παρουσιάζει υψηλές επιδόσεις σε υπολογισμούς κλίμακας σε παράλληλα και κατανεμημένα περιβάλλοντα. 3. Εκμεταλλεύεται την αραιότητα του μητρώου δεδομένων. 4. Παρουσιάζονται υλοποιήσεις του αλγορίθμου και πειραματικά αποτελέσματα χρησιμοποιώντας συνθετικά και πραγματικά δεδομένα που δεί-

22 22 CONTENTS χνουν ότι ο αλγόριθμος λειτουργεί αποδοτικά στην πράξη. Στην προσέγγισή μας η βασική παρατήρηση είναι ότι οι τιμές μόχλευσης βρίσκονται στη διαγώνιο του λεγόμενου μητρώου hat. Υπενθυμίζουμε ότι πρόκειται για το μητρώο ορθογώνιας προβολής H = AA, ή H = A(A A) 1 A εάν το A είναι πλήρους τάξης. Η χρήση της δεύτερης εξίσωσης περιλαμβάνει την επίλυση της εξίσωσης μητρώων A AX = A Παρατηρούμε ότι πρόκειται για n γραμμικά συστήματα μεγέθους d d, όπου d n. Με βάση αυτά προτείνουμε να χρησιμοποιηθεί μία επαναληπτική μέθοδος για την επίλυση αυτών των συστημάτων με απώτερο στόχο τον υπολογισμό των τιμών μόχλευσης. Συγκεκριμένα προτείνεται η χρήση μιας Μπλοκ μεθόδου Συζυγών Κλίσεων (BCG). Για να γίνει πρακτική μια τέτοια προσέγγιση υιοθετούμε τεχνικές προρρύθμισης από την υπάρχουσα βιβλιογραφία [4], [32] και χρησιμοποιούμε JLT για τη μείωση του αριθμού των δεξιών μελών από n σε O(ln(n)/ϵ 2 ), όπου ϵ πολλαπλασιαστικός παράγοντας σφάλματος στο τελικό αποτέλεσμα. Δείχνουμε ότι η διαστατική μείωση και οι επαναληπτικές μέθοδοι μπορούν να συνδυαστούν αποδοτικά και κατά συνέπεια αποδεικνύουμε θεωρητικά αποτελέσματα που αφορούν τόσο το ρυθμό σύγκλισης των εμπλεκόμενων επαναληπτικών μεθόδων καθώς και την ακρίβεια της προσέγγισης των τιμών μόχλευσης. Στη Δ.Ε. μελετώνται διάφορες παραλλαγές των ανωτέρω εργαλείων. Περιγράφουμε περιληπτικά την προτεινόμενη προσέγγιση στον Αλγόριθμο 3. Συνοπτικά τα αποτελέσματά μας είναι τα ακόλουθα. 1. Ο Αλγόριθμος 3 αποδεικνύεται ότι επιστρέφει προσεγγίσεις θ i των τιμών μόχλευσης θ i για τις οποίες ισχύει με μεγάλη πιθανότητα η ακόλουθη ανι-

23 CONTENTS 23 Αλγόριθμος 3 Προτεινόμενη προσέγγιση 1: Υπολογισμός του μητρώου B = AΠ όπου το Π είναι JLT. 2: Υπολογισμός της SVD του GA = UΣV όπου G τυχαίο μητρώο με στοιχεία που ακολουθούν την κανονική κατανομή. 3: Επίλυση (AN) (AN)Y = N B με χρήση επαναληπτικής μεθόδου, όπου N = V Σ 1 προρρυθμιστής για το A. 4: Επιστροφή των Ευκλείδειων νορμών των γραμμών του AX, όπου X = NY. σότητα θ i θ i ϵθ i + f(ϵ, τ, d), όπου f(ϵ, τ, d) είναι κάποια συνάρτηση των παραμέτρων ϵ (η επιθυμητή ακρίβεια του αποτελέσματος), τ (το κριτήριο τερματισμού της επαναληπτικής μεθόδου) και d (η μικρή διάσταση του A). δείχνουμε τόσο στην ανάλυση όσο και στα πειράματα της Δ.Ε. ο προσθετικός αυτός παράγοντας είναι αμελητέος στην πράξη. 2. Κατασκευάζοντας προρρυθμιστή όπως περιγράφεται στα βήματα 2,3 του Αλγορίθμου 3 αποδεικνύεται ότι για γ > 1, για κάθε α (0, 1 1/γ) με πιθανότητα τουλάχιστον 1 2e α2 γd/2 θα χρειαστούν το πολύ k log τ log(α + 1/γ) επαναλήψεις έτσι ώστε η BCG να συγκλίνει σε λύση με σχετικό κατάλοιπο μικρότερο ή ίσο από τ. 3. Η υπολογιστική πολυπλοκότητα του αλγορίθμου είναι περίπου O ( nd 2) όμως οι ακριβότεροι υπολογισμοί μπορούν να γίνουν παράλληλα και εκμεταλλεύονται την αραιότητα του A, καθιστώντας τον αλγόριθμο πρακτικό

24 24 CONTENTS και αποτελεσματικό σε παράλληλα και κατανεμημένα υπολογιστικά περιβάλλοντα. Ενδεικτικά αποτελέσματα Στο Σχήμα 2 παρουσιάζουμε ενδεικτικά αποτελέσματα από πειράματα. Σημειώνουμε ότι για τα συγκεκριμένα σύνολα δεδομένων ο Αλγόριθμος 1 δεν είναι πρακτικό να χρησιμοποιηθεί γιατί για την επίτευξη αντίστοιχης ακρίβειας με τον προτεινόμενο αλγόριθμο το μέγεθος του μητρώου μετά τη μείωση διαστατικότητας είναι μεγαλύτερο από το αρχικό μέγεθος του A. mesh_deform 10 4 rail4284 time (seconds) Number of MPI processes Figure 2: Χρόνοι εκτέλεσης για τα σύνολα δεδομένων mesh_deform και rail4284 χρησιμοποιώντας 1,2,4 και 8 MPI processes. Δομή της εργασίας Η εργασία αποτελείται από έξι κεφάλαια. Στο Κεφάλαιο 2 γίνεται μια επισκόπιση των θεωρητικών αποτελεσμάτων από την τρέχουσα βιβλιογραφία μείωσης διαστατικότητας με χρήση τυχαίων μητρώων. Στο Κεφάλαιο 3 γίνεται μελέτη επαναληπτικών μεθόδων για την επίλυση

25 CONTENTS 25 συστημάτων με πολλά δεξιά μέλη και τεχνικές για την κατασκευή προρρυθμιστών. Εξετάζονται παραλλαγές της μεθόδου Συζυγών Κλίσεων για την περίπτωση γραμμικών συστημάτων με πολλά δεξιά μέλη. Στο Κεφάλαιο 4 γίνεται επισκόπηση προηγούμενων προσεγγίσεων του προβλήματος και συγκρίνουμε με την προτεινόμενη προσέγγιση. Ο πρώτος αλγόριθμος που αναλύεται βασίζεται σε ένα πλαίσιο για την εκτίμηση της διαγωνίου ενός μητρώου χρησιμοποιώντας τυχαιότητα. Οι υπόλοιποι αλγόριθμοι βασίζονται σε τεχνικές μείωσης διαστατικότητας. Παρουσιάζονται τα πλεονεκτήματα και μειονεκτήματα των αλγορίθμων αυτών και δίνεται θεωρητική ανάλυση της προσέγγισης που προτείνεται, τόσο για την ακρίβεια των αποτελεσμάτων όσο και την υπολογιστική πολυπλοκότητα. Στο Κεφάλαιο 5 παρουσιάζονται πειραματικά αποτελέσματα χρησιμοποιώντας συνθετικά και πραγματικά σύνολα δεδομένων. Αρχικά επιλέγονται μητρώα με ειδική δομή και ιδιότητες έτσι ώστε να επιβεβαιωθούν τα θεωρητικά αποτελέσματα στην πράξη. Επίσης δοκιμάζονται διαφορετικές τιμές για τις παραμέτρους των αλγορίθμων έτσι ώστε να βελτιωθεί η απόδοση. Τέλος παρουσιάζονται τα αποτελέσματα σε πραγματικά σύνολα δεδομένων από πειράματα που εκτελέσθηκαν σε κατανεμημένο σύστημα υπολογισμού. Στο Κεφάλαιο 6 γίνεται ανασκόπηση της εργασίας και συζήτηση σχετικά με θέματα μελλοντικής έρευνας.

26 26 CONTENTS

27 Chapter 1 Introduction In this thesis we consider algorithms for fast estimations of leverage scores. Statistical leverage scores are a powerful tool for data analysis and statistics and have been successfully used for outlier detection in datasets [10], [11], [12], locating important nodes in graphs [31] and more recently have been successfully applied to numerical linear algebra algorithms [20], [31], [45]. In applications, data points are commonly stored as columns of a matrix and leverage scores are values which gauge the influence of each point on the best-fit line of the data set. See the visualization in Figure 1.1. Points which are far from the best-fit line have large leverage scores. Definition 1. Let matrix A R n d, where n > d. Let U R n d be a matrix whose columns are an orthonormal basis for the column space of A. The leverage score of the i-th row (data point) of A, say θ i for i = 1,..., n, is defined as θ i def = U (i) 2 2, (1.1) where U (i) denotes the i-th row of U and 2 denotes the vector 2-norm. The largest leverage score µ def = max 1 i n θ i is called matrix coherence. 27

28 28 CHAPTER 1. INTRODUCTION Figure 1.1: A line that approximates a set of points with respect to Euclidean distances.² A straightforward approach for the exact computation of leverage scores is by computing an orthonormal basis for the column space of A via SVD or QR decomposition. Such an approach costs O(nd 2 ) floating point operations, which might be prohibitively expensive when the dataset consists of a very large number of points with high dimensionality. In recent bibliography algorithms have been proposed which return approximations θ i of the leverage scores θ i which satisfy with high probability the following inequality θ i θ i ϵθ i (1.2) while the computational complexity is o(nd 2 ). 1.1 Applications As already discussed, leverage scores have been broadly used for outlier detection in linear regression statistics ([10], [11]). Data points with high leverage on the ``best- ² Figure from Wikipedia article Regression analysis.

29 1.1. APPLICATIONS 29 fit'' line might either be illegitimate or of high actual importance. A rule of thumb states that such data points can be traced if their leverage scores are higher than 2 or 3 times the mean leverage of all the set, i.e. if θ i > 2d/n; cf. [43], [12]. In more recent bibliography [8], [41] it is argued that leverage scores reveal if a linear model is appropriate for a dataset, e.g. non-uniform leverage scores suggest that it might not. In graph analytics, consider a graph G(V, E) where V is the set of d nodes and E is the set of n edges where each edge is associated with some weight w. The edge incidence matrix B of G is defined as an n d matrix where each row represents an edge in E and has only 2 non-zero values at the columns which correspond to the nodes that are connected by the specific edge. Taking W to be the diagonal weight matrix the Laplacian of G is defined as L = B W B. The degree of a node is the number of edges connected to it. Important nodes tend to have a high degree while important edges usually are the ones which connect large communities or clusters. A useful concept that can reveal important edges are the so called effective resistances which are the diagonal entries of R = BL B. It is easy to see that effective resistances are proportional to leverage scores of the matrix W 1/2 B; cf. [35]. Leverage scores are also of importance in randomized numerical linear algebra (RNLA). In that area, a common strategy is to sample rows and/or columns based on some type of ``importance distribution''. The use of leverage scores in sampling can lead to improved performance of RNLA algorithms compared to the uniform sampling approach [31], [45], [15]. It has also been shown that deterministic sampling using leverage scores can be used efficiently for low rank approximations; cf. [37]. Dimensionality reduction Algorithms which achieve inequality (1.2) use randomized dimensionality reduction techniques. More specificaly, various methods have been examined in literature

30 30 CHAPTER 1. INTRODUCTION which use matrices with random elements to multiply the data matrix reducing its dimensionality, while certain properties are approximately preserved including dot products, vector norms and singular values. We refer to two basic concepts. The first is Johnson-Lindenstrauss transforms (JLT) [26]. JLTs are random matrices which transform a set of vectors from R d to R r, where r < d, preserving the pairwise dot products up to a multiplicative error term. More specifically Definition 2. A random matrix Π of size r d forms a Johnson-Lindenstrauss transform with parameters ϵ, δ, f, or JLT(ϵ, δ, f), if with probability at least 1 δ, for any f-element set V subset of R d, for all v, w V it holds that Πv, Πw v, w ϵ v 2 w 2. Taking w = v it follows that JLTs also approximately preserve vector 2-norms. The second concept is subspace embedding matrices (SE) [40]. The difference between SEs and JLTs is that they preserve Euclidean distances of an entire subspace rather than a finite set of vectors. This subspace is described by the column space of a matrix A. Definition 3. Given a matrix A of size n d an ϵ-se for the column space of A is a matrix S such that for all x R d SAx 2 2 = (1 ± ϵ) Ax Algorithms Returning to the computation of leverage scores, the current state-of-the-art algorithms use JLTs and/or SEs in order to reduce the dimensionality of the data matrix and ultimately perform computations on the resulting matrix which has smaller size.

31 1.2. ALGORITHMS 31 This way the computational complexity is reduced while the values returned are good approximations of the true leverage scores with high probability. In previous work, Holdonak et al. study the conditioning of leverage scores and give perturbation results by computing an orthonormal basis for the column space using the QR decomposition [23]. For large matrices, however, this becomes be a very expensive task. In [30]³ Malik Magdon-Ismail describes an algorithm for approximate computation of statistical leverage scores which achieves computational complexity of o(nd 2 ). The idea pioneered is to use a matrix with random entries (often called a ``sketch'') and multiply it with the data matrix A in order to reduce its dimension and ultimately decrease the computational overhead at the expense of a multiplicative error term in the final result. In [20] Drineas et al. improve the complexity by exploiting properties of the Moore-Penrose generalized inverse and use similar dimensionality reduction techniques. In [13] Clarkson et al. present a new embedding which is extremely sparse; it only requires 1 non-zero element per column. This matrix is very fast to multiply with sparse datasets in comparison to the one in [20]. A drawback is that the dimension of the reduced matrix is Ω(d 2 ) (see [33])⁴ and therefore is more useful on ``extremely tall and thin'' sparse matrices and in streaming environments. An extensive analysis of such algorithms is given in [22]. In this work Gittens et al. propose an algorithm which computes very fast approximations to the leverage scores of the best rank-k spectral approximation of A. Very recently Drineas et al. give bounds for similar low-rank matrix approximations from Krylov subspaces; cf. [19]⁵. We briefly state 2 algorithms from [20] and [22]. ³This work is not published to date. ⁴The original reduced dimension by Clarkson et al. was O(d 4 ) and was later improved to Ω(d 2 ) by Nelson et al. ⁵This work is not published to date.

32 32 CHAPTER 1. INTRODUCTION Algorithm 4 [20] 1: Compute B = SA where S is a SE for the column space of A. 2: Compute the compact SVD of B = UΣV. 3: Use a JLT Π for n 2 vectors and compute C = V Σ 1 Π. 4: Return the row norms of AC. Algorithm 5 [22] 1: Compute B = AΠ where Π is a SE for the best rank-k approximation of A. 2: Compute C = (AA ) q B, where q 0 is an integer. 3: Return the row norms of C. 1.3 Contribution In this thesis we study the advantages and limitations of the algorithms mentioned earlier. Combining ideas from the current state-of-the-art, our contribution is an algorithm which has the following properties: 1. Successfully returns good approximations of leverage scores and the coherence of a full rank tall and thin matrix. 2. Scales well in parallel/distributed environments. 3. Effectively utilizes sparsity. 4. Performs very well in modern computational environments on synthetic and real world problems In our approach the key observation is that leverage scores can be found in the diagonal of the so called hat matrix. Recall that it is the orthogonal projection matrix H = AA or H = A(A A) 1 A if A is full rank. The second equation involves

33 1.3. CONTRIBUTION 33 the solution of the matrix equation A AX = A Note that it consists of n linear systems of size d d. In order to solve such systems we propose to use an iterative method in order to ultimately compute leverage scores. Specifically we propose the use of a Block Conjugate Gradients algorithm (BCG). In order for this approach to be practical we adopt preconditioning techniques from recent literature [4], [32] and use JLTs in order to reduce the number of right hand sides from n to O(ln(n)/ϵ 2 ), where ϵ a small multiplicative error term to the final result. We show that dimensionality reduction and iterative methods can be effectively combined and as a consequence we prove theoretical results concerning the convergence rate of BCG and bounds on the estimations returned. In this Thesis we study several variations of the aforementioned tools. We briefly describe our approach in Algorithm 6. Algorithm 6 Levis 1: Compute B = A Π where Π is a JLT. 2: Compute the SVD of GA = USV, where the elements of G are drawn from N(0, 1). 3: Solve (AN) (AN)Y = N B using an iterative method, where N = V Σ 1 is a preconditioner for A. 4: Return the row norms of AX, where X = NY. In brief, our results are described as follows. 1. We propose an algorithm which provably returns the values θ i and the following inequality is satisfied with high probability θ i θ i (1 + ϵ)θ i + f(ϵ, τ, d),

34 34 CHAPTER 1. INTRODUCTION where f(ϵ, τ, d) is a function of the input parameters ϵ (the estimation accuracy), τ (the convergence tolerance of the iterative method) and d (the small dimension of A). This additive error term is negligible as we will show in the chapters to follow. 2. Constructing a preconditioner as described in steps 2, 3 of Algorithm 6 it can be proved that with probability at least (1 2e α2 γd/2 ) BCG will require at most iterations k log τ log(α + d/m), to converge to a solution with relative residual less than or equal to τ. 3. The computational complexity of the algorithm is approximately O ( nd 2) but the heavier computations can be executed in parallel and utilize the sparsity of A, rendering it practical to use in parallel/distributed environments, outperforming the current state-of-the-art in many cases. 1.4 Some indicative results In Figure 1.2 we present results from our experiments. In this graph we can see the runtimes of a python implementation of Algorithm 6 in an MPI environment. We note that Algorithm 4 is not practical to use for both of these datasets because in order to achieve similar accuracy to that of Algorithm 6 the size of the resulting matrix after the dimensionality reduction is actually larger than A itself.

35 1.5. OUTLINE 35 mesh_deform 10 4 rail4284 time (seconds) Number of MPI processes Figure 1.2: Total runtime for the mesh_deform and rail4284 datasets using 1,2,4 and 8 MPI processes. 1.5 Outline The rest of this thesis is structured as follows. In Chapter 2 we review all the tools that are used by our algorithm and by the current state of the art. We review theoretical properties of randomized embeddings, variations of CG and Block CG algorithms and respective convergence results. We also review various preconditioning techniques. In Chapter 3 we review previous work in detail and compare theoretically to our approach. The first algorithm is based on an framework for estimating the diagonal entries of a matrix. The rest of the algorithms dive more deeply in the randomized embeddings literature and exploit existing results in order to improve computational complexity. We point out the advantages and drawbacks of each algorithm. We also give theoretical analysis for our approach, estimation bounds for the values returned

36 36 CHAPTER 1. INTRODUCTION and the total computational complexity. In Chapter 4 we present experimental results on synthetic and real world datasets. First, we choose input matrices with special structure and properties in order to test our theoretical results in practice and also to tune various input parameters in order to choose appropriate default values. Second we run experiments on synthetic datasets to provide evidence that our algorithm performs in practice as intended. Finally we present our results using sparse real world datasets on a distributed environment. In Chapter 5 we give our concluding remarks and discussion for future work. 1.6 Notation In the chapters to follow we denote by θ i and µ the leverage scores and coherence. Capital letters denote matrices, small letters denote vectors and Greek small letters denote constants. The inner product of two vectors v, u is denoted by v, u. For a vector v, v p denotes the vector p-norm ( v is equivalent to v 2 ). For a matrix A, A (i) is the i-th row as a row vector and A (i), a i is the i-th column as a column vector, A is the transpose of A, A is the Moore-Penrose generalized inverse, trace(a) is the sum of the diagonal entries and range(a) is the column space. Also, A p = sup{ Ax p, x p = 1} is the induced p-norm and κ p (A) = A p A p is the condition number of A w.r.t the p-norm (κ(a) is equivalent to κ 2 (A)). The singular values of A are denoted by σ 1 σ 2... σ r, where r is the rank of A. For a positive integer n, [n] the set of all positive integers up to n, while log(n) is the natural logarithm of any real positive n. By N(0, 1) we denote the standard normal probability distribution and P[a] [0, 1] denotes the probability of event a to happen. E[x] denotes the expectation of a random variable x and V ar[x] is its variance. We denote by the element-wise multiplication between two vectors and by the element-wise division.

37 Chapter 2 Dimensionality reduction Datasets in modern applications tend to increase in size rapidly. In the era of Big Data, dimensionality reduction is a key ingredient to analyze large datasets. Sketching is a very powerful tool in modern algorithms for numerical linear algebra. Many state of the art algorithms use randomized embeddings to speed up computations while returning a good approximation of the true solution with high probability. Some applications include regression, low rank approximations and graph sparsification [15], [31], [45]. In this section we review a few of these tools and properties that will be later used in our analysis. Taking A R n d a tall and thin matrix, there exists a linear map B from R n R d such that for every x, y in range(a) the following holds B(x y) 2 2 = x y 2 2 This embedding comes immediately from the thin QR decomposition of A. Take A = QR, Q R n d has orthonormal columns, and R R d d is upper triangular. The Euclidean distance between any two vectors in range(a) gives Ax Ay 2 2 = A(x y) 2 2 = QR(x y) 2 2 = R(x y)

38 38 CHAPTER 2. DIMENSIONALITY REDUCTION the last equality holding because Q has orthonormal columns. It also holds that Q A = Q QR Q A = R Thus Q is a linear map that preserves pairwise distances in range(a) since A(x y) 2 2 = R(x y) 2 2 = Q A(x y) 2 2 The intuition behind this equality is that, since the columns of A form a subspace of dimension no more than d, then there exists a rotation of this subspace such that it's orthonormal basis will be d canonical vectors. Q performs this rotation. This is not practical, however, because the cost of computation and also because Q is not oblivious, which means that is dependent on the input and will not be useful for other matrices than A, except if they share the same column space. 2.1 Johnson-Lindenstrauss transforms The Johnson-Lindenstrauss lemma [26] states that n vectors in R d can be mapped down to O(log n) dimensions while the inner products will be preserved up to some multiplicative error. We give the following definition for a Johnson-Lindenstrauss transform (JLT). Definition 4. A random matrix Π of size r n forms a Johnson-Lindenstrauss transform with parameters ϵ, δ, f, or JLT(ϵ, δ, f), if with probability at least 1 δ, for any f-element subset V of R n, for all v, w V it holds that Πv, Πw v, w ϵ v 2 w 2. In [25] a JLT is constructed as a r n matrix where the elements are independent standard normal random variables, scaled by 1/ r. More formally

39 2.1. JOHNSON-LINDENSTRAUSS TRANSFORMS 39 Definition 5 (GJLT [25]). Let V a set of n vectors v i R d,i [n]. Let Π be a 4 log n r d matrix with i.i.d entries drawn from N(0, 1) and r ϵ 2 /2 ϵ 3, with /3 ϵ (0, 1/2). Then, with probability at least 1 1/n, for all pairs v, w V (1 ϵ) v w 2 2 Π(v w) 2 2 (1 + ϵ) v w 2 2. Since then many improvements and refinements have been proposed; cf. [2], [16], [28], [34]. See [45], [7] for a detailed review. Achlioptas in [2] presents two distributions of matrices which form JLTs. Definition 6 ([2]). Take V a set of n vectors v i R d,i [n] and parameters ϵ, δ 4 log n + 2 log 1 δ (0, 1). Let Π be a r d matrix with r ϵ 2 /2 ϵ 3. Each element π of Π /3 is drawn independently from either one of the following two probability distributions and rescaled by 1/ r: +1 w.p. 1/2 π = 1 w.p. 1/2 (RJLT) +1 w.p. 1/6 π = 1 w.p. 1/6 0 w.p. 2/3 (SJLT) Then, with probability at least 1 δ (1 ϵ) v i 2 2 Πv i 2 2 (1 + ϵ) v i 2 2 For the RJLT it is easy to prove that E[Π Π] = I d (2.1) and V ar[π Π] = (d/r 1)I d (2.2)

40 40 CHAPTER 2. DIMENSIONALITY REDUCTION 2.2 Subspace embeddings JLTs preserve the pairwise dot products of a set with finite number of elements. Subspace embeddings, on the other hand, preserve pairwise distances between vectors from a whole subspace of R n. Definition 7 (Subspace embedding). Given a matrix A of size n d an ϵ-subspace embedding for the column space of A is a matrix S such that for all x R d SAx 2 2 = (1 ± ϵ) Ax 2 2. One can assume without loss of generality that A has orthonormal columns, call it U, then the definition above can be simplified to I d U S SU 2 ϵ, (2.3) where I n is the n n size identity matrix and U is a matrix with orthonormal columns (i.e. an orthonormal basis for range(a)). For more details see Chapter 2 in [45]. Of much interest are the Oblivious Subspace Embeddings (OSE), which are random matrices drawn from a distribution such that with high probability they will form an ϵ-subspace embedding, independent from the input. Definition 8 (OSE). Given input parameters n, d, ϵ, δ let Π be a distribution on r n size matrices S where r is a function of n, d, ϵ, δ. Π will be called an (ϵ, δ)-oblivious subspace embedding if S, drawn from Π, is an ϵ-subspace embedding for any fixed n d matrix A with probability at least 1 δ. Henceforth, for brevity, the term subspace embedding will denote an (ϵ, δ)-ose in l 2. In general ϵ is desired to be small. Ideally, S should be chosen s.t. I U S SU is minimized. Subspace embeddings can be constructed through JLTs as first proposed in [40].

41 2.2. SUBSPACE EMBEDDINGS 41 Many of these analyses also aim to keeping S sparse. This is desirable for two reasons, first because Sx can be computed very fast and second to reduce the memory overhead. It is worth noting, however, that it is possible to apply Sx fast without S being sparse (e.g. the Fast JLT in [3]) Subsampled Randomized Hadamard Transform The following theorem defines the Subsampled Randomized Hadamard Transform. (See [3],[20],[42],[45]). Theorem 1. (Subsampled Randomized Hadamard Transform) n Let S = r P H nd where D is an n n diagonal matrix with i.i.d. diagonal entries D (i,i) in which D ( i, i) = ±1 with probability 1/2. H n is a Walsh-Hadamard matrix of size n¹, i.e. H n (i, j) is given by ( 1/ n) i 1,j 1 where i 1, j 1 is the dot product of the m-bit vectors i, j expressed in binary. The r n matrix P samples r coordinates of an n-dimensional vector uniformly at random, where r d log ( ) 2d δ ϵ 2 Then with probability at least 1 (δ + n e d ) for any fixed n d matrix U with orthonormal columns, I d U S SU 2 ϵ. Moreover, Sx can be computed in O(n log r), x R n. The expectation and variance of S S are given by ( ) n E[S S] = I n, V ar[s 2 S] = r 2 1 I n (2.4) ¹Note that n has to be equal to 1, 2, 4k where k N.

42 42 CHAPTER 2. DIMENSIONALITY REDUCTION Sparse Embedding Matrix While O(n log r) is much faster than the standard O(nr) that is required for a matrix-vector multiplication, in [45] it has been shown that there exist subspace embeddings s.t. Sx can be computed in O(n). The number of required rows was improved in [34]. The key idea is that the vectors we want to embed is not an arbitrary set of vectors in R n but rather a specific set of vectors coming from the column space of A. Theorem 2. (Sparse Embedding Matrix) Let h : [n] [r] be a random map s.t. i [n], h(i) = k for k [t] distributed uniformly. Let S = ΦD where Φ {0, 1} r n is a binary matrix with Φ h(i),i = 1, i [n], and all the other entries of the matrix are equal to 0. D is a n n diagonal matrix where the diagonal elements are +1, 1 with probability 1/2 chosen independently. If r d2 + d δ(2ϵ ϵ 2 ) 2 then for any fixed n d matrix A, S will be a subspace embedding for the column space of A with probability at least 1 δ while SA can be computed in O(nnz(A)) time. The expectation and variance of S S are given by E[S S] = I n + 1 r (ee I n ), V ar[s S] = rn + r 1 r 2 I n n r I n (2.5) Matrix In Table 2.1 we sum up properties for the SRHT and the Sparse Embedding

43 2.2. SUBSPACE EMBEDDINGS 43 Table 2.1: Subspace embeddings for fixed matrix A R n d with n d and input parameters ϵ, δ. type min{r} E[S S] C = V ar[s S] SA time ( (log d)( d + ) log n) 2 ( n ) SRHT Ω I n r 1 I n O(nd log n) ϵ 2 SEM O(d 2 /(δϵ 2 )) I n + 1 r (ee I n ) rn + r 1 r 2 I n O(nnz(A))

44 44 CHAPTER 2. DIMENSIONALITY REDUCTION

45 Chapter 3 Least squares problems with multiple right hand sides Assuming a tall and skinny matrix A R n d of full column rank and a vector b R n, let x denote the unconstrained least squares solution x = arg min x Ax b 2. (3.1) For the case of the least squares problems with multiple right hand sides we are interested in solving X = {[x 1, x 2,..., x r] x i = arg min x i b i Ax i 2, i [r]}. (3.2) The solution is given by X = A B or equivalently X = (A A) 1 A B since A is not rank deficient. 3.1 Rank deficiency and leverage scores computation At this point we want to make a few comments concerning rank deficiency of the input data. We point out that in [20] the authors also consider the input is a full col- 45

46 46 CHAPTER 3. LEAST SQUARES WITH MRHS. umn rank matrix. They note, however, that theoretically there exists a straightforward approach to handle rank deficiency but it is left as future work to examine numerical rank deficiency, which is a common phenomenon in real world applications. We point out that our approach is not designed to handle numerical rank deficiency and we provide experimental results to support it. In [23] the authors give bounds for the relative accuracy of individual leverage scores by computing using QR decomposition. They point out that there exist applications where, in practice, a truncated SVD should be used to handle numerical rank deficiency. Our algorithm is not designed to successfully be used on ill conditioned datasets but it is possible to receive knowledge and terminate to use another algorithm in such occasions. 3.2 Solving each system independently A naive approach is to use Conjugate Gradients to solve each system separately, i.e. i [r] solve A Ax i = (A (i) ). We give an algorithmic description in Algorithm Complexity In each iteration the dominant complexity factor is a matrix-vector multiplication with A. This takes O(nd) computations. A well known convergence result of the CG method states that the following inequality holds for the error term e k = x x k after iteration k ( ) k x k x M κ(m) 1 x 0 x 2 (3.3) M κ(m) + 1 Given tolerance τ such that x 0 x M τ x 0 x M then

47 3.3. BLOCK-CG 47 Algorithm 7 CGLS (A, b, x 0, τ) Input: A, b, x 0, τ Output: x i 1: Set r 0 = A b A Ax 0 2: Set p 0 = r 0 3: set i = 0 4: repeat 5: α i = r i 2 2 p i A A 6: x i+1 = x i + p i α i 7: r i+1 = r i α i A Ap i 8: β i = r i r i 2 2 9: p i+1 = r i+1 + β i p i 10: i = i : until r i+1 / r 0 τ 12: return x i. k 1 1 κ(m) log 2 τ (3.4) iterations suffice in exact arithmetic. Replacing M with A A we get k 1 κ(a 2 A) log 1 1 = τ 2 κ(a) log 1 τ (3.5) 3.3 Block-CG In [36] analysis is given for the Block Conjugate Gradients (BCG) algorithm and variations. Instead of solving each linear system separately a block-krylov subspace

48 48 CHAPTER 3. LEAST SQUARES WITH MRHS. Algorithm 8 MRHS-CGLS (A, Z, X 0, τ) Input: A, Z = {z (1), z (2),..., z (r) }, X 0 = {x (1) Output: X 1: for i = 1,..., r do 2: Set x (i) = CGLS(A, z (i), x (i) 0, τ) 3: end for 4: return X = {x (1), x (2),..., x (r) } 0, x(2) 0,..., x(r) 0 }, τ is formed instead. We denote by K m a block Krylov subspace of order m, i.e. K m = {Z, MZ, M 2 Z,..., M m 1 Z}. In each iteration i a new block-solution X i is chosen to minimize tr[(x i X) M(X i X)] over all X i such that X i X 0 K i and X is the true solution of MX = Z. We give an algorithmic description in Algorithm 9 for the case where M = A A Complexity The number of iterations required for such method is described by Lemma 1; cf. [36]. PBCG has the following costs per iteration. We mark with ( ) those which contribute the most to the overall computational cost. 3 (2d 1)r 2 flops, MM between direction/residual blocks 2 O(r 3 ) flops, Computing the pseudoinverse of r r matrices ( ) 2 (2d 1)dr flops, MM between N and direction blocks ( ) 2 (2d 1)nr flops, MM between A and direction blocks 3 (2r 1)dr flops, MM between direction block and α, β 3 dr flops, Updating solution, residual and direction blocks Lemma 1. After iteration k of Algorithm 9 the error of the i-th right hand side is

Δείτε περισσότερα