Performance improvement of iterative solver using bit-compression for a sparse matrix

Σχετικά έγγραφα
Implementation and performance evaluation of iterative solver for multiple linear systems that have a common coefficient matrix

FX10 SIMD SIMD. [3] Dekker [4] IEEE754. a.lo. (SpMV Sparse matrix and vector product) IEEE754 IEEE754 [5] Double-Double Knuth FMA FMA FX10 FMA SIMD

Αριθµητικές Μέθοδοι Collocation. Απεικόνιση σε Σύγχρονες Υπολογιστικές Αρχιτεκτονικές

BiCG CGS BiCGStab BiCG CGS 5),6) BiCGStab M Minimum esidual part CGS BiCGStab BiCGStab 2 PBiCG PCGS α β 3 BiCGStab PBiCGStab PBiCG 4 PBiCGStab 5 2. Bi

ΤΕΧΝΙΚΕΣ ΑΥΞΗΣΗΣ ΤΗΣ ΑΠΟΔΟΣΗΣ ΤΩΝ ΥΠΟΛΟΓΙΣΤΩΝ I

Partial Trace and Partial Transpose

GPU DD Double-Double 3 4 BLAS Basic Linear Algebra Subprograms [3] 2

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

Buried Markov Model Pairwise

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

Homework 3 Solutions

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Binary32 (a hi ) 8 bits 23 bits Binary32 (a lo ) 8 bits 23 bits Double-Float (a=a hi +a lo, a lo 0.5ulp(a hi ) ) 8 bits 46 bits Binary64 11 bits sign

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

TMA4115 Matematikk 3

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Transient Voltage Suppression Diodes: 1.5KE Series Axial Leaded Type 1500 W

[1] P Q. Fig. 3.1

Tridiagonal matrices. Gérard MEURANT. October, 2008

Other Test Constructions: Likelihood Ratio & Bayes Tests

Partial Differential Equations in Biology The boundary element method. March 26, 2013

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

ΑΚΑ ΗΜΙΑ ΕΜΠΟΡΙΚΟΥ ΝΑΥΤΙΚΟΥ ΜΑΚΕ ΟΝΙΑΣ ΣΧΟΛΗ ΜΗΧΑΝΙΚΩΝ ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ

Higher Derivative Gravity Theories

GMRES(m) , GMRES, , GMRES(m), Look-Back GMRES(m). Ax = b, A C n n, x, b C n (1) Krylov.

derivation of the Laplacian from rectangular to spherical coordinates

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

CE 530 Molecular Simulation

Numerical Analysis FMN011

Instruction Execution Times

Second Order RLC Filters

CMOS Technology for Computer Architects

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Durbin-Levinson recursive method

SMD Transient Voltage Suppressors

Efficient Implementation of Sparse Linear Algebra Operations on InfiniBand Cluster. Akira Nishida,

EE512: Error Control Coding

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

Section 8.3 Trigonometric Equations

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

ER-Tree (Extended R*-Tree)

Simplex Crossover for Real-coded Genetic Algolithms

Computing Gradient. Hung-yi Lee 李宏毅

Second Order Partial Differential Equations

Elements of Information Theory

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

Αρχιτεκτονική Σχεδίαση Ασαφούς Ελεγκτή σε VHDL και Υλοποίηση σε FPGA ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

D Alembert s Solution to the Wave Equation

Homework 8 Model Solution Section

Anomaly Detection with Neighborhood Preservation Principle

CHAPTER 101 FOURIER SERIES FOR PERIODIC FUNCTIONS OF PERIOD

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Orthogonalization Library with a Numerical Computation Policy Interface

w o = R 1 p. (1) R = p =. = 1

ΣΥΓΚΡΙΣΗ ΑΝΑΛΥΤΙΚΩΝ ΚΑΙ ΑΡΙΘΜΗΤΙΚΩΝ ΜΕΘΟ ΩΝ ΓΙΑ ΤΗ

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

ΕΙΣΑΓΩΓΗ στους Η/Υ. Δρ. Β Σγαρδώνη. Τμήμα Τεχνολογίας Αεροσκαφών ΤΕΙ ΣΤΕΡΕΑΣ ΕΛΛΑΔΑΣ. Χειμερινό Εξάμηνο

ΘΕΩΡΗΤΙΚΗ ΚΑΙ ΠΕΙΡΑΜΑΤΙΚΗ ΙΕΡΕΥΝΗΣΗ ΤΗΣ ΙΕΡΓΑΣΙΑΣ ΣΚΛΗΡΥΝΣΗΣ ΙΑ ΛΕΙΑΝΣΕΩΣ

A research on the influence of dummy activity on float in an AOA network and its amendments

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Delhi Noida Bhopal Hyderabad Jaipur Lucknow Indore Pune Bhubaneswar Kolkata Patna Web: Ph:

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

Differential equations

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

ΜΕΤΑΠΤΥΧΙΑΚΗ ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ «ΘΕΜΑ»

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

CUDA FFT. High Performance 3-D FFT in CUDA Environment. Akira Nukada, 1, 2 Yasuhiko Ogata, 1, 2 Toshio Endo 1, 2 and Satoshi Matsuoka 1, 2, 3

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

ΖΩΝΟΠΟΙΗΣΗ ΤΗΣ ΚΑΤΟΛΙΣΘΗΤΙΚΗΣ ΕΠΙΚΙΝΔΥΝΟΤΗΤΑΣ ΣΤΟ ΟΡΟΣ ΠΗΛΙΟ ΜΕ ΤΗ ΣΥΜΒΟΛΗ ΔΕΔΟΜΕΝΩΝ ΣΥΜΒΟΛΟΜΕΤΡΙΑΣ ΜΟΝΙΜΩΝ ΣΚΕΔΑΣΤΩΝ

Assalamu `alaikum wr. wb.

CRASH COURSE IN PRECALCULUS

Quantitative chemical analyses of rocks with X-ray fluorescence analyzer: major and trace elements in ultrabasic rocks

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. «Προστασία ηλεκτροδίων γείωσης από τη διάβρωση»

Solutions to Exercise Sheet 5

ΜΕΘΟΔΟΙ ΑΕΡΟΔΥΝΑΜΙΚΗΣ

Matrices and Determinants

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

The challenges of non-stable predicates

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

Yoshifumi Moriyama 1,a) Ichiro Iimura 2,b) Tomotsugu Ohno 1,c) Shigeru Nakayama 3,d)

Mean bond enthalpy Standard enthalpy of formation Bond N H N N N N H O O O

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Comparison of characteristic by Transformer Winding Method of Contactless Power Transfer Systems for Electric Vehicle

Capacitors - Capacitance, Charge and Potential Difference

The Simply Typed Lambda Calculus

Math 6 SL Probability Distributions Practice Test Mark Scheme

DESKTOP - Intel processor reference chart

Modbus basic setup notes for IO-Link AL1xxx Master Block

Technical Research Report, Earthquake Research Institute, the University of Tokyo, No. +-, pp. 0 +3,,**1. No ,**1

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Stabilization of stock price prediction by cross entropy optimization

Transcript:

E6- Performance improvement of iterative sover using -compression for a sparse matrix, 7--6, E-mai keno@riken.jp Kenji Ono, RIKE AICS, 7--6 Minatojima-minami-cho, Chuo-ku, Kobe, Japan A nobe Bit-representation/compression technique is proposed to enhance the performance of iterative methods for a arge-size sparse matrix. This technique is appied to the impementation of iterative kernes with Dirichet and eumann boundary conditions. The first advantage of this approach is that it reduces memory traffic from main memory to and effectivey utiizes SIMD units with cache. Secondy, the proposed impementation can repace if-branch statements with mask operations using the expression. This promotes the optimization of code during compiation and run-time. The Red-Back SOR and BiCGstab agorithms are empoyed to investigate the proposed impementation. Consequenty, the proposed approach achieves. times faster than a naïve impementation on both Inte and Fujitsu Sparc architectures.. Poisson () () () () SIMD Roofine () Roofine Operationa Intensity Operationa Intensity. avierstokes Poisson () ( p) = div ( u t ) ϕ, () p u ϕ Poisson () 7 eumann Dirichet (7) Heaviside () (Boundary Condition) H = () (F uid) p Heaviside p p = p H + ( H) p () () () () ( p H ) n = h ϕ ( ) H p n () h H n eumann Heaviside if () SIMD Dirichet Heaviside H D Fig. e ( ) p H = { e p i+ H e D h + ( H e D ) } () p i+ p i H e Ax = b (6) ( p H D H ) p = h ϕ h H ( ) H p n ( H D ) p H n (6) Copyright c by JSFM

E6- p w = n w i- i s e p e = p i+ p i h p i+ i+ + c_t * p(i,j,k+) + c_b * p(i,j,k-) dp = ( (ss + b(i,j,k) ) / dd - pp ) * omg p(i,j,k) = pp + dp res = res + dbe(dp*dp) * dbe( is(bp(i,j,k), Active, ) ) [Bit-reps code] Fig. : eumann and Dirichet boundary conditions for ce i in two dimensions. A eumann is appied at the west ce face, which is soidy shaded. A Dirichet is empoyed at the east ce face, where the boundary vaue is given by the pressure p i+. () 6 6 9 Fig. Diag dag x x x D x eumann Dirichet Encoding; inine int onbit (int idx, const int s) { return ( idx (x<<s) ); } Decoding; #define BIT_SHIFT(a,b) ( (a >> b) & x ). Red-Back SOR RB-SOR pn(i,j,k,n) Fortran [aive code] do coor=, do k=,kx do j=,jx do i=+mod(k+j+coor,), ix, c_w = pn(i,j,k,) c_e = pn(i,j,k,) c_s = pn(i,j,k,) c_n = pn(i,j,k,) c_b = pn(i,j,k,) c_t = pn(i,j,k,6) dd = pn(i,j,k,7) pp = p(i,j,k) ss = c_e * p(i+,j,k ) + c_w * p(i-,j,k ) + c_n * p(i,j+,k ) + c_s * p(i,j-,k ) do coor=, do k=,kx do j=,jx do i=+mod(k+j+coor,), ix, idx = bp(i,j,k) c_e = rea( is(idx, _dag_e, ) ) c_w = rea( is(idx, _dag_w, ) ) c_n = rea( is(idx, _dag_, ) ) c_s = rea( is(idx, _dag_s, ) ) c_t = rea( is(idx, _dag_t, ) ) c_b = rea( is(idx, _dag_b, ) ) d = rea( is(idx, _Diag+, ) ) d = rea( is(idx, _Diag+, ) ) d = rea( is(idx, _Diag+, ) ) dd = d*. + d*. + d pp = p(i,j,k) ss = c_e * p(i+,j,k ) + c_w * p(i-,j,k ) + c_n * p(i,j+,k ) + c_s * p(i,j-,k ) + c_t * p(i,j,k+) + c_b * p(i,j,k-) dp = ( (ss + b(i,j,k) ) / dd - pp ) * omg p(i,j,k) = pp + dp res = res + dbe(dp*dp) * dbe( is(idx, Active, ) ) p, b, bp, pn (6) Fig. is(bp(i,j,k), Active, ) pn b bp p p, b, bp / Operationa Intensity Fop/Byte, pn, b, bp 9 p i-, i, i+, j-, j+ k-, k+ b, bp p i-, i, i+, j-, j+ 6 Fop/Byte Tabe fop 8fops Sparc Copyright c by JSFM

E6-9 7 8 9 _Diag (~6) _dag_e _dag_w _dag_s _dag dag_t _dag_b W E S B _D_W T _D_S _D_E _D_T _D_B _D_ State Active Fig. : Bit representation. Severa s required for the -representation are encoded into this array. This exampe incudes diagona( Diag), non-diagona ( dag x), eumann boundary( x), Dirichet boundary ( D x), ce state (State), and activeness (Active) of a ce. Other s are used for more compicated processes. Tab. : Specification of evauation machines. TRIAD scores are measured by the STREAM benchmark (). Architecture Cock CPU Peak Cache Memory Theoretica TRIAD (GHz) () (MB) (GB) BW (GB/s) (GB/s) Xeon X6.66 6 7.7 6 6 Xeon E-67.6 8 66. 6 9 Xeon E-68. 8 96. 6 9 Sparc VIIIfx. 8 8. 6 6 6 6 Sparc IXfx.8 6 6. 8 Tab. : Comparison of characteristic for two types of impementation. aïve Bit-Reps. Memory Requirement unit unit Load & Store + + Arithmetic 6 F/B... Tabe () φ = Dirichet/eumann () Performance Monitor ibrary (PMib) (8), () PMib PAPI Fig. Fujitsu Venus IXfx 6 L Fujitsu Venus VIIIfx Fujitsu Venus IXfx FX textitfujitsu Venus VIIIfx is(a, b, ) SIMD Fujitsu Venus IXfx Fig. 6 6 Inte Fujitsu Venus VIIIfx IXfx VIIIfx Inte 6 8. IXfx 6 Tabe Fig. F/B=. SIMD Westmere(X6) Sparc VIIIfx F/B=. Sparc IXfx SIMD Inte is(a, b, x) x= Copyright c by JSFM

E6- Fig. FFV-C (9) PFops FFV-C % Attainabe Performance () Westmere Sparc VIIIfx Sparc IXfx Westmere -reps Westmere Sparc VIIIfx -reps Sparc VIIIfx Sparc IXfx -reps Sparc IXfx!"!# $!!!"!#! Operationa Intensity (Fops/Byte) Fig. : Performance anaysis of Roofine mode for naïve and -reps impementation. GFLOPS x 6 x x x x x Idea FFV-C x x x x x umber of Processes () S. Wiiams, S., Waterman, A. and Patterson, D.: Roofine; An Insightfu Visua Performance Mode for Muticore Arch. Commun. ACM, Vo. o. (9) 6 76 () Yokokawa, M. : Vector-Parae Processing of the Successive Overreaxation Method. Japan Atomic Energy Research Institute JAERI-M Report o. 88-7 (988) in Japanese () Wicock, J. and Lumsdaine, A.: Acceerating sparse matrix computations via data compression. Proc. th Annua ICS 6 (6) 7 6 () Tang, W. T., et a.: Acceerating Sparse Matrix-vector Mutipication on GPUs Using Bitrepresentation-optimized Schemes. Proc. of SC 6 () () Van der Vorst, H. A. : Bi-CGSTAB: A Fast and Smoothy Converging Variant of Bi-CG for the Soution of onsymmetric Linear Systems. SIAM J. Sci. and Stat. Comput. Vo.bf o. (99) 6-6 (6) Ono, K. and Kawashima, Y. : Muticoor SOR Method with Consecutive Memory Access Impementation in a Shared and Distributed Memory Parae Environment. Lecture otes in Computationa Science and Engineering, Vo.7 () 8 9 (7) Ono, K., Chiba, S., Inoue, S., and Minami, K. : Performance Improvement of Iterative Methods using a Bit-Representation Technique for Coefficient Matrices. Vecpar, () (8) Ono, K., Kawashima, Y. and Kawanabe, T.: Data Centric Framework for Large-scae Highperformance Parae Computation. Procedia Computer Science, Vo.9 () 6 (9) http://avr-aics-riken.github.io/ffvc\ _package/ () http://avr-aics-riken.github.io/pmib/ () http://www.cs.virginia.edu/stream Fig. : Measured performance of FFV-C code on the K computer with 8,9 nodes. Each node has 8 cores.. Poisson Inte Sparc. SIMD Copyright c by JSFM

E6-6 6 8 6 6 6 8 6 (a) Inte Xeon X6. (b) Inte Xeon E-67. 6 6 8 6 6 6 8 6 (c) Fujitsu Sparc Venus VIIIfx. (d) Fujitsu Sparc Venus IXfx. Fig. : Comparison of seria performance of each machine. The probem size varies ranging from 6 to 6. Copyright c by JSFM

E6-6 8 6 8 6 (a) Inte Xeon X6. (b) Inte Xeon E-67. 6 8 6 8 6 (c) Fujitsu Sparc Venus VIIIfx. (d) Fujitsu Sparc Venus IXfx. Fig. 6: Comparison of thread parae performance of each machine. The probem size is chosen to 6 so that the data resides in main memory. 6 Copyright c by JSFM