Implementation and performance evaluation of iterative solver for multiple linear systems that have a common coefficient matrix

Σχετικά έγγραφα
Performance improvement of iterative solver using bit-compression for a sparse matrix

FX10 SIMD SIMD. [3] Dekker [4] IEEE754. a.lo. (SpMV Sparse matrix and vector product) IEEE754 IEEE754 [5] Double-Double Knuth FMA FMA FX10 FMA SIMD

BiCG CGS BiCGStab BiCG CGS 5),6) BiCGStab M Minimum esidual part CGS BiCGStab BiCGStab 2 PBiCG PCGS α β 3 BiCGStab PBiCGStab PBiCG 4 PBiCGStab 5 2. Bi

GPU DD Double-Double 3 4 BLAS Basic Linear Algebra Subprograms [3] 2

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

GMRES(m) , GMRES, , GMRES(m), Look-Back GMRES(m). Ax = b, A C n n, x, b C n (1) Krylov.

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Numerical Analysis FMN011

A Fast Finite Element Electromagnetic Analysis on Multi-core Processer System

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Buried Markov Model Pairwise

Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

Αριθµητικές Μέθοδοι Collocation. Απεικόνιση σε Σύγχρονες Υπολογιστικές Αρχιτεκτονικές

TMA4115 Matematikk 3

Tridiagonal matrices. Gérard MEURANT. October, 2008

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Binary32 (a hi ) 8 bits 23 bits Binary32 (a lo ) 8 bits 23 bits Double-Float (a=a hi +a lo, a lo 0.5ulp(a hi ) ) 8 bits 46 bits Binary64 11 bits sign

Wavelet based matrix compression for boundary integral equations on complex geometries

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

Efficient Implementation of Sparse Linear Algebra Operations on InfiniBand Cluster. Akira Nishida,

Durbin-Levinson recursive method

High order interpolation function for surface contact problem

Detection and Recognition of Traffic Signal Using Machine Learning

New Adaptive Projection Technique for Krylov Subspace Method

Yoshifumi Moriyama 1,a) Ichiro Iimura 2,b) Tomotsugu Ohno 1,c) Shigeru Nakayama 3,d)

(, ) (SEM) [4] ,,,, , Legendre. [6] Gauss-Lobatto-Legendre (GLL) Legendre. Dubiner ,,,, (TSEM) Vol. 34 No. 4 Dec. 2017

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Quick algorithm f or computing core attribute

Fourier transform, STFT 5. Continuous wavelet transform, CWT STFT STFT STFT STFT [1] CWT CWT CWT STFT [2 5] CWT STFT STFT CWT CWT. Griffin [8] CWT CWT

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Simplex Crossover for Real-coded Genetic Algolithms

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

GCG-type Methods for Nonsymmetric Linear Systems

ΜΕΘΟΔΟΙ ΑΕΡΟΔΥΝΑΜΙΚΗΣ

Orthogonalization Library with a Numerical Computation Policy Interface

ΕΚΤΙΜΗΣΗ ΤΟΥ ΚΟΣΤΟΥΣ ΤΩΝ ΟΔΙΚΩΝ ΑΤΥΧΗΜΑΤΩΝ ΚΑΙ ΔΙΕΡΕΥΝΗΣΗ ΤΩΝ ΠΑΡΑΓΟΝΤΩΝ ΕΠΙΡΡΟΗΣ ΤΟΥ

A Hierarchy of Theta Bodies for Polynomial Systems

Computing Gradient. Hung-yi Lee 李宏毅

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

ΤΕΧΝΙΚΕΣ ΑΥΞΗΣΗΣ ΤΗΣ ΑΠΟΔΟΣΗΣ ΤΩΝ ΥΠΟΛΟΓΙΣΤΩΝ I

Computation Method to Improve Three-phase Voltage Imbalance by Exchange of Single-phase Load Connection

ΕΘΝΙΚΟ ΚΑΙ ΚΑΠΟΔΙΣΤΡΙΑΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΣΧΟΛΗ ΘΕΤΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΠΙΚΟΙΝΩΝΙΩΝ

Homework 3 Solutions

HOSVD. Higher Order Data Classification Method with Autocorrelation Matrix Correcting on HOSVD. Junichi MORIGAKI and Kaoru KATAYAMA

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Matrices and Determinants

Maude 6. Maude [1] UIUC J. Meseguer. Maude. Maude SRI SRI. Maude. AC (Associative-Commutative) Maude. Maude Meseguer OBJ LTL SPIN


EE512: Error Control Coding

ER-Tree (Extended R*-Tree)

Homework 8 Model Solution Section

Second Order Partial Differential Equations

Reminders: linear functions

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Comparison of Evapotranspiration between Indigenous Vegetation and Invading Vegetation in a Bog

Metal Oxide Varistors (MOV) Data Sheet

Bundle Adjustment for 3-D Reconstruction: Implementation and Evaluation

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

ECE 468: Digital Image Processing. Lecture 8

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

- 1+x 2 - x 3 + 7x x x x x x 2 - x 3 - -

ΓΕΩΜΕΣΡΙΚΗ ΣΕΚΜΗΡΙΩΗ ΣΟΤ ΙΕΡΟΤ ΝΑΟΤ ΣΟΤ ΣΙΜΙΟΤ ΣΑΤΡΟΤ ΣΟ ΠΕΛΕΝΔΡΙ ΣΗ ΚΤΠΡΟΤ ΜΕ ΕΦΑΡΜΟΓΗ ΑΤΣΟΜΑΣΟΠΟΙΗΜΕΝΟΤ ΤΣΗΜΑΣΟ ΨΗΦΙΑΚΗ ΦΩΣΟΓΡΑΜΜΕΣΡΙΑ

Lifting Entry (continued)

n 1 n 3 choice node (shelf) choice node (rough group) choice node (representative candidate)

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ. «Θεσμικό Πλαίσιο Φωτοβολταïκών Συστημάτων- Βέλτιστη Απόδοση Μέσω Τρόπων Στήριξης»

Other Test Constructions: Likelihood Ratio & Bayes Tests

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

Supplementary Materials for Evolutionary Multiobjective Optimization Based Multimodal Optimization: Fitness Landscape Approximation and Peak Detection

Congruence Classes of Invertible Matrices of Order 3 over F 2

Breaking capacity: ~200kA Rated voltage: ~690V, 550V. Operating I 2 t-value (A 2 s) Power

SMD Transient Voltage Suppressors

Optimization, PSO) DE [1, 2, 3, 4] PSO [5, 6, 7, 8, 9, 10, 11] (P)

Laplace Expansion. Peter McCullagh. WHOA-PSI, St Louis August, Department of Statistics University of Chicago

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Using the Jacobian- free Newton- Krylov method to solve the sea- ice momentum equa<on

ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ ΤΜΗΜΑ ΟΔΟΝΤΙΑΤΡΙΚΗΣ ΕΡΓΑΣΤΗΡΙΟ ΟΔΟΝΤΙΚΗΣ ΚΑΙ ΑΝΩΤΕΡΑΣ ΠΡΟΣΘΕΤΙΚΗΣ

ΜΕΛΕΤΗ ΑΠΟΤΙΜΗΣΗΣ ΠΑΡΑΓΟΜΕΝΗΣ ΕΝΕΡΓΕΙΑΣ Φ/Β ΣΥΣΤΗΜΑΤΩΝ ΕΓΚΑΤΕΣΤΗΜΕΝΩΝ ΣΕ ΕΛΛΑ Α ΚΑΙ ΤΣΕΧΙΑ

Εφαρµογές µε ανάγκες για υψηλές επιδόσεις Αξιολόγηση επίδοσης Παραλληλοποίηση εφαρµογών

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

CUDA FFT. High Performance 3-D FFT in CUDA Environment. Akira Nukada, 1, 2 Yasuhiko Ogata, 1, 2 Toshio Endo 1, 2 and Satoshi Matsuoka 1, 2, 3

A research on the influence of dummy activity on float in an AOA network and its amendments


Metal Film Flame-Proof Resistors

Srednicki Chapter 55

Correction of chromatic aberration for human eyes with diffractive-refractive hybrid elements

A Method of Trajectory Tracking Control for Nonminimum Phase Continuous Time Systems

difluoroboranyls derived from amides carrying donor group Supporting Information

Conjoint. The Problems of Price Attribute by Conjoint Analysis. Akihiko SHIMAZAKI * Nobuyuki OTAKE

Development of a Tiltmeter with a XY Magnetic Detector (Part +)

Concrete Mathematics Exercises from 30 September 2016

CORDIC Background (4A)

Example Sheet 3 Solutions

Transcript:

Implementation and performance evaluation of iterative solver for multiple linear systems that have a common coefficient matrix, 1-1, E-mail almisofte@gmail.com, 7-1-26, E-mail keno@riken.jp, 1-1, E-mail yokokawa@port.kobe-u.ac.jp Seigo Imamura, Graduate School of System Informatics Kobe University, 1-1 Rokkodai-cho Nada-ku Kobe, Japan Kenji Ono, RIKEN AICS, 7-1-26 Minatojima-minami-cho, Chuo-ku, Kobe, Japan Mitsuo Yokokawa, Graduate School of System Informatics Kobe University, 1-1 Rokkodai-cho Nada-ku Kobe, Japan Capacity computing is a promising scenario for improving performance on upcoming exascale supercomputers. Ensemble computing is an instance and has multiple linear systems associated with a common coefficient matrix. We implement to reduce load cost of coefficient matrix by solving them at the same time and improve performance of several iterative solvers. The maximum performance on SPARC64 XIfx on a node of FX1 was 6.8 times higher than that of naïve implementation. Finally, to deal with the different convergence processes of linear systems, we proposed control methods to skip the calculation of already converged vectors and conduct more acceleration. 1. CPU CPU 1 Capacity Computing (1) Capacity Computing CFD CFD Poisson CPU Byte/Flop B/F B/F Roofline model (2) Operational intensity Flop/Byte Operational intensity Compressed Row Storage CRS (3) Bit (4),(5) B/F 2. CFD ( p) = div( u t ) ϕ (1) p u ϕ (1) Cartesian 2 7 7 (6) Bi-CGstab (7) Bi-CGstab SOR 3 4 Bi-CGstab 3. Fortran SOR List 1 Inner loop naïve List 2 Outer loop Lists 1,2 p, b, bp Ax = b x b A ac i, j, k l 1 do k=1, kx 2 do j=1, jx 3 do i=1, ix List 1: Pseudo code of SOR using Inner loop. 4 ndag_e = bp (1,i,j, k)! e 5 ndag_w = bp (2,i,j, k)! w 1 Copyright c 215 by JSFM

6 ndag_n = bp (3,i,j, k)! n 7 ndag_s = bp (4,i,j, k)! s 8 ndag_t = bp (5,i,j, k)! t 9 ndag_b = bp (6,i,j, k)! b 1 dd = bp (7,i,j, k)! diagonal 11 ac = bp (8,i,j, k)! active 12 dx = 1. / ac 13 14 do l=1, lx 15 pp = p(l,i,j, k) 16 bb = b(l,i,j, k) 17 18 ss = ndag_e *p(l,i+1,j,k )& 19 + ndag_w *p(l,i -1,j,k )& 2 + ndag_n *p(l,i,j+1,k )& 21 + ndag_s *p(l,i,j -1,k )& 22 + ndag_t * p(l, i,j,k +1)& 23 + ndag_b * p(l, i,j,k -1) 24 dp = ((ss -bb )*dx -pp )* omg 25 pn = pp + dp* ac 26 p(l,i,j, k) = pn 27 28 de = dble (bb -(ss -pn*dd )) 29 res (l) = res (l) + de*de * ac 3 end do 31 end do 32 end do 33 end do 1 do l=1, lx 2 do k=1, kx 3 do j=1, jx 4 do i=1, ix List 2: Pseudo code of SOR using Outer loop. 5 ndag_e = bp (1,i,j, k)! e 6 ndag_w = bp (2,i,j, k)! w 7 ndag_n = bp (3,i,j, k)! n 8 ndag_s = bp (4,i,j, k)! s 9 ndag_t = bp (5,i,j, k)! t 1 ndag_b = bp (6,i,j, k)! b 11 dd = bp (7,i,j, k)! diagonal 12 ac = bp (8,i,j, k)! active 13 14 pp = p(i,j,k, l) 15 bb = b(i,j,k, l) 16 17 ss = ndag_e *p(i+1,j,k,l)& 18 + ndag_w *p(i -1,j,k,l)& 19 + ndag_n *p(i,j+1,k,l)& 2 + ndag_s *p(i,j -1,k,l)& 21 + ndag_t *p(i,j,k+1,l)& 22 + ndag_b *p(i,j,k -1,l) 23 dp = (( ss - bb )/dd -pp )* omg 24 pn = pp + dp* ac 25 p(i,j,k, l) = pn 26 27 de = dble (bb -(ss -pn*dd )) 28 res (l) = res (l) + de*de * ac 29 end do 3 end do 31 end do 32 end do Operational intensity Flop/Byte SPARC64 IXfx 1 Outer loop dimension(ix, jx, kx, lx) Inner loop dimension(lx, ix, jx, kx) 128 3 ix, jx, kx= 128 i Operational intesity flops Outer loop 3flops Inner loop 8 + 23 lx Bytes Outer loop bp 1 8 p i+1, i, i-1, j+1, j-1, k+1, k-1 b, res p 5 1 Inner loop bp Outer loop p l, i+1, i, i-1, j+1, j-1, k+1, k-1 lx 4 k+1, k-1 2 b, res p res lx Tab. 1 128 Inner loop Outer loop 2.2 Inner loop k loop OpenMP Outer loop k loop Outer loop Red-Black SOR (8) R-B SOR Inner loop List 3 R-B SOR SOR i loop R-B SOR i loop Operational intensity Tab. 1 Tab. 1: Characteristics for two types of implementations. Grid Size = 128 3 Outer Inner(RHS=lx < 4) Load & Store 4 + 2 4 + 2 lx Arithmetic 3 8 + 23 lx F/B 1.25 1.29 2.8 Grid Size = 128 3 Inner(RHS=lx > 4) Inner(RHS=128) Load & Store 6 + 2 lx 6 + 2 128 Arithmetic 8 + 23 lx 8 + 23 128 F/B 2.8 2.81 List 3: Pseudo code of Red-Black SOR using Inner loop. 1 do color =,1 2 do k=1, kx 3 do j=1, jx 4 do i =1+ mod (k+j+ color,2),ix,2 5 ndag_e = bp (1,i,j, k)! e 6 ndag_w = bp (2,i,j, k)! w 7 ndag_n = bp (3,i,j, k)! n 8 ndag_s = bp (4,i,j, k)! s 9 ndag_t = bp (5,i,j, k)! t 1 SPARC64 IXfx 1flop 8flops 2 Copyright c 215 by JSFM

1 ndag_b = bp (6,i,j, k)! b 11 dd = bp (7,i,j, k)! diagonal 12 ac = bp (8,i,j, k)! active 13 14 dx = 1. / dd 15 16 do l=1, lx 17 pp = p(l,i,j, k) 18 bb = b(l,i,j, k) 19 2 ss = ndag_e * p(l, i+1, j,k )& 21 + ndag_w * p(l,i -1, j,k )& 22 + ndag_n * p(l, i,j+1, k )& 23 + ndag_s * p(l, i,j -1, k )& 24 + ndag_t * p(l, i,j,k +1)& 25 + ndag_b * p(l, i,j,k -1) 26 dp = ((ss -bb )*dx -pp )* omg 27 pn = pp + dp* ac 28 p(l,i,j, k) = pn 29 3 de = dble (bb -(ss -pn*dd )) 31 res (l) = res (l) + de*de * ac 32 end do 33 34 end do 35 end do 36 end do 37 end do 3.1 Outer loop Inner loop Tab. 2 SPARC64 IXfx FX1 Tab. 3 (1) 3 2 ϕ = Dirichlet/Neumann Performance Monitor library PMlib (9) PMlib -Kfast,parallel,ocl,preex,array private,auto -Kopenmp,simd=2,uxsimd Tab. 2: Specification of a computing node of FX1. CPU SPARC64 XIfx Clock 1.65 GHz Core 16 Ideal peak performance 211.2 Cache 12 MB Memory 32GB Bandwidth 85 GB/s Inner loop SOR R-B SOR 64 Outer loop R-B SOR 128 SOR Outer loop.6 Inner loop 4.2 7 Operational intensity Inner loop SOR Outer loop 7 6 5 4 3 2 1 7 6 5 4 3 2 1 Outer Inner Outer Inner Number of RHS vectors(l) (a) SOR method Number of RHS vectors(l) (b) R-B SOR method Fig. 1: Comparison of sequential performance with a problem size 128 3. Tab. 3: Investigated parameters. Solver SOR method, Red Black SOR method Grid size 32 3, 64 3, 128 3, 256 3 1 to 16 Number of pressure vectors 1, 2, 4, 8, 16, 32, 64, 128 SOR R-B SOR Fig. 1 Outer loop SOR R-B SOR Figs. 2, 3 SOR Outer Inner loop R-B SOR Outer loop Fig. 3(a) Inner loop Fig. 3(b) SOR Inner loop SOR 128 66.3 31.3% Outer loop SOR 9.7 6.8 3 Copyright c 215 by JSFM

7 6 5 4 3 RHS=1 RHS=2 RHS=4 RHS=8 RHS=16 RHS=32 RHS=64 RHS=128 7 6 5 4 3 RHS=1 RHS=2 RHS=4 RHS=8 RHS=16 RHS=32 RHS=64 RHS=128 2 2 1 1 2 4 6 8 1 12 14 16 2 4 6 8 1 12 14 16 (a) Outer loop (a) Outer loop 7 7 6 6 5 5 4 3 4 3 2 2 1 1 2 4 6 8 1 12 14 16 2 4 6 8 1 12 14 16 (b) Inner loop (b) Inner loop Fig. 2: Threading performance of SOR method with a problem size 128 3. Fig. 3: Threading performance of R-B SOR method with a problem size 128 3. 4 Copyright c 215 by JSFM

4. Bi-CGstab Bi-CGstab Outer loop if l Inner loop if Inner loop if l if if 2 1 Fig. 4(a) l 1 Fig. 4(b) if Algorithm 1 Bi-CGstab x k, b, r, r k, p k, q Fig. 5 X, B 1 Algorithm 1 BiCGstab algorithm 1: Start with an initial guess x 2: Compute r = b Ax 3: Choose an arbitrary vector r such that ρ = (r r ), e.g., r = r 4: p = r 5: k= 6: repeat 7: q = Ap k 8: α = ρ k /(r q) 9: s = r k αq 1: t = As 11: ω = (t s)/(t t) 12: x k+1 = x k + αp k + ωs 13: r k+1 = s ωt 14: ρ k+1 = (r r k+1 ) 15: β = (α/ω)(ρ k+1 /ρ k ) 16: p k+1 = r k+1 + β(p k ωq) 17: ρ k ρ k+1 (a) Exchanging vectors. 18: k++ 19: until r k+1 2 / b 2 < ϵ (b) Reshaping arrays. Fig. 4: Two types of control methods. 2 Fig. 5: Rotating memory space of arrays. 5 Copyright c 215 by JSFM

4.1 SOR Bi-CGstab 1 Average time Fig. 6 Outer loop List 2 l loop if Inner loop(no) List 1 Inner loop(if) List 1 l loop if Inner loop(exchanging) Inner loop(reshaping) Outer loop Average time Inner loop Average time Inner loop Outer loop Average time [sec] 6 5 4 3 2 1 Outer loop Inner loop(no) Inner loop(if) Inner loop(exchanging) Inner loop(reshaping) The number of RHS vectors (l) Fig. 6: Comparison of average time with a problem size 128 3. Fig. 7 l Fig. 1(a) if 5. SPARC64 XIfx 6.8 31.3% JSPS 2628687 (1),, (28) pp.1 7 (2) Samuel W. Williams, Andrew Waterman, David A. Patterson, Roofline: An Insightful Visual Performance Model for Floating Point Programs and Multicore Architectures, Commun, ACM, Vol. 52 (29), pp. 65 76. (3) Willcock, J. and Lumsdaine, A., Accelerating sparse matrix computations via data compression, Proc. 2th Annual ICS 6 (26) pp.37 316 (4) Tang, W. T., et al., Accelerating Sparse Matrix-vector Multiplication on GPUs Using Bitrepresentation-optimized Schemes, Proc. of SC13 26 (213) pp.1 12 (5) Ono, K., Chiba, S., Inoue, S., Minami, K., Low Byte/Flop Implementation of Iterative Solver for Sparse Matrices Derived from Stencil Computations, High Performance Computing for Computational Science VECPAR 214 (215) Vol.8969 pp.192 25. (6) Imamura, S., Ono, K., Yokokawa, M., PER- FORMANCE EVALUATION OF ITERATIVE SOLVER FOR MULTIPLE VECTORS ASSOCI- ATED WITH A LARGE-SCALE SPARSE MA- TRIX, Proc. 27 th International Conference on Parallel Computational Fluid Dynamics (215) pp.124 125. (to appear) (7) Van der Vorst, H. A., Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems, SIAM J. Sci. and Stat. Comput. Vol.13 No.2 (1992) pp.631 644 (8) Yokokawa, M., Vector-Parallel Processing of the Successive Overrelaxation Method, Japan Atomic Energy Research Institute JAERI-M Report No. 88-17 (1988) (in Japanese) (9) http://avr-aics-riken.github.io/pmlib/ 7 6 5 4 3 2 1 No If Exchanging Reshaping The number of RHS vectors (l) Fig. 7: Comparison of the performance of control methods with a problem size 128 3. 6 Copyright c 215 by JSFM