GPU DD Double-Double 3 4 BLAS Basic Linear Algebra Subprograms [3] 2

GPU 4 1,a) 2,b) 1 GPU Tesla M2050 Double-Double DD 4 BiCGStab GPU 4 BiCGStab 1 1.0 2.2 4 GPU 4 1. IEEE754-2008[1] 128bit binary128 CG Conjugate Gradient [2] 1 1 2 a) mukunoki@hpcs.cs.tsukuba.ac.jp b) daisuke@cs.tsukuba.ac.jp GPU DD Double-Double 3 4 BLAS Basic Linear Algebra Subprograms [3] DD 2 4 2 4 Byte/Flop GPU Level 2 GEMV 4 CG SpMV Level 1 SpMV 4 1 4 2 4 4 NVIDIA Tesla M2050 GPU BiCGStab 4 4 GPU 4 c 2012 Information Processing Society of Japan 1

p 0 = r 0 = r 0 = b Ax 0 ρ 0 = ( r 0, r 0 ) for : k = 0, 1, 2,... do v = Ap k α = ρ k /( r 0, v) s = r k αv t = As ω = (t, s)/(t, t) x k+1 = x k + αp k + ωs r k+1 = s ωt if r k+1 / b < ɛ then break end if ρ k 1 = ρ k ρ k = ( r 0, r k+1 ) β = (ρ k /ρ k 1 )(α/ω) p k+1 = r k+1 + β(p k ωv) end for 1 BiCGStab 2. GPU BiCGStab GPU NVIDIA Fermi GPU GPGPU CUDA 4 2.1 BiCGStab BiCGStab 1 BiCGStab SpMV y = Ax 2 DOT r = (x, y) 4 AXPY y = αx + y 3 AXPYZ z = αx + y 2 XPAY y = x + αy 1 2 NRM2 1 SpMV Level-1 Level-2 SpMV CPU 4 CUBLAS[4] DNRM2 AXPYZ XPAY NVIDIA CUBLAS cusparse[5] 4 NVIDIA 4 BiCGStab SpMV global void SpMV_CRS_kernel (int m, double alpha, double* a_val, int* a_ptr, int* a_idx, double* x, double beta, double* y) { int tx = threadidx.x; int tid = blockdim.x * blockidx.x + tx; int rowid = tid / NUM_T; int lane = tid % NUM_T; int maxrow = MAX_BLK * NUM_TH / NUM_T; shared double vals[num_th]; while (rowid < m){ vals[tx] = 0.0; for (int i = a_ptr[rowid] + lane; i < a_ptr[rowid+1]; i += NUM_T) vals[tx] = a_val[i] * x[a_idx[i]] + vals[tx]; for (i = NUM_T/2; i > 0; i = i >> 1) { if (lane < i) vals[tx] += vals[tx + i]; } if (lane == 0) y[rowid] = alpha * vals[tx] + beta * y[rowid]; if (m > maxrow) syncthreads (); rowid += griddim.x * blockdim.x / NUM_T; } } 2 DOT SpMV 4 NVIDIA 3.1 2.2 SpMV GPU [6] CRS Compressed Row Storage Bell [7] 1 y = Ax y 1 c 2012 Information Processing Society of Japan 2

1 [%] 4 4 / 1489752 10319760 0.0005 279 266 0.95 36441 565761 0.0426 338 332 0.98 17758 126150 0.0400 2448 2288 0.93 155331 11283503 0.0468 3203 2125 0.66 14734 95053 0.0438 605 545 0.90 38434 206156 0.0140 4423 3059 0.69 221119 7666057 0.0157 4225 2164 0.51 crankseg 2 63838 14148858 0.3472 7835 3702 0.47 2395 17319 0.3019 822 743 0.90 11341 98523 0.0766 3916 2474 0.63 adder trans 01 1814 14579 0.4431 299 205 0.69 circuit 2 4510 21199 0.1042 741 469 0.63 116835 766396 0.0056 2539 1607 0.63 8081 13036 0.0200 229 161 0.70 2048 10114 0.2411 2495 1774 0.71 TSOPF RS b9 c6 7224 54082 0.1036 1319 488 0.37 y 1 8 8 SpMV 2 BiCGStab y = Ax cusparse SpMV y = αax + βy NUM T 1 NUM T=8 NUM TH 1 128 MAX BLK 1 65535 2.3 4 4 Double-Double DD DD Dekker[8] Bailey [9] double 2 4 2 4 [3] 4 DD 4 DD 4 4 DD [3] SoA Structure of Arrays 1 DD 2 AoS Array of Structures 4 SpMV Level 1 SoA AoS CUBLAS DNRM2 DD DD 4 BiCGStab α β CPU CPU 4 QD [9] QD DD 4 3. 1 The University of Florida Sparse Matrix Collection[10] 208 c 2012 Information Processing Society of Japan 3

5.0 SpMV DOT AXPY 4.0 3.0 2.0 1.0 0.0 3 Tesla M2050 NVIDIA TSOPF_RS_b9 BiCGStab 10,000 51 ɛ = 1e 12 b = (1...1) T, x 0 = (0...0) T 4 BiCGStab 30 14 1.3 7 30 4 BiCGStab 8 8 16 1 4 BiCGStab ɛ = 1e 12 b = (1...1) T, x 0 = (0...0) T 10,000 4 GPU Fermi NVIDIA Tesla M2050 CPU Intel Xeon E5630 (2.53GHz, 4-core) 2 OS CentOS 6.3 CUDA 5.0 (Driver Version: 304.54) gcc 4.4.6 -O3 nvcc 5.0 -O3 -arch sm 20 GPU GPU BiCGStab 1 3 1 2 b 3 SpMV 3 1 3.1 4 NVIDIA 3 NVIDIA CUDA5.0 cusparse CUBLAS SpMV DOT AXPY 1 NVIDIA 1 NVIDIA DOT AXPY SpMV 16 1.4 DOT AXPY CUBLAS AXPYZ XPAY AXPY 3.2 4 BiCGStab 4 BiCGStab 4 BiCGStab 4 BiCGStab 1 4 8 4 8 BiCGStab 10,000 51 8 4 1 1.0 2.2 16 1.5 c 2012 Information Processing Society of Japan 4

2.5 total 1 iteration 2.0 1.5 1.0 0.5 0.0 TSOPF_RS_b9_c6 4 Tesla M2050 BiCGStab 1 1 1 4 4 2 1 2 crankseg 2 1 4 1 16 4 1 1.0 2.2 4 4. 4 1 2 1 4 2 4 4 4.1 4 DD a b + c 20 [3] BiCGStab SpMV 4 20 4 BiCGStab 1 2.2 1 5 1 SpMV SpMV 6 BiCGStab SpMV DOT AXPY 4 4 SpMV 1.4 2.5 DOT 1.3 2.1 AXPY 1.0 3.2 GPU 4 (1) 2 (2) 1 (3) 2 3 4.1.1 4 2 4 2 4 2 2 GPU Byte/Flop Tesla M2050 Byte/Flop 515.2GFlops 144GB/s 0.3 Byte/Flop 4 DD 20 Flops 1 DD DDFlops 515.2/20 = 25.76[GDDFlops] Byte/Flop 5.6Byte/DDFlop Byte/Flop SpMV c 2012 Information Processing Society of Japan 5

100% 100% 80% 80% 60% 60% 40% 40% 20% 0% SpMV DOT DNRM2 Others 5 TSOPF_RS_b9_c6 20% 0% SpMV DOT DNRM2 Others Tesla M2050 1 4 TSOPF_RS_b9_c6 3.5 SpMV DOT AXPY 3.0 2.5 2.0 1.5 1.0 0.5 0.0 TSOPF_RS_b9 6 Tesla M2050 BiCGStab 1 Byte/Flop 16 4.0Byte/Flop 4 8.1 Byte/DDFlop Byte/Flop Tesla M2050 4 BiCGStab 4.1.2 4 4 [3] 3 6 1 4 SpMV 32bit GPU 4.1.3 4 2 4 AXPY 3 4 N=36,441 N=38,434 AXPY 4 4 2 N=37,000 4 4.2 4 fill-in LU ILU(0) 4 CPU c 2012 Information Processing Society of Japan 6

2 Lis BiCGStab ILU(0) 4 4 253 265 110 110 328 356 22 22 2956 2182 170 1345 2376 2000 564 336 584 546 113 112 5129 2969 - - 3988 1940 511 312 crankseg 2 6801 3757 639 523 895 799 981 2785 3456 2456 - - adder trans 01 351 188 - - circuit 2 740 523 60 61 2458 1530 813 266 233 166 31 27 2463 1793-3912 TSOPF RS b9 c6 1361 483 - - Lis a Library of Iterative Solvers for linear systems [11] 2 Lis Version 1.2.65 4 ILU(0) BiCGStab CPU GPU FMA Fused Multiply-Add ILU(0) 4 4 4 4 ILU(0) ILU(0) BiCGStab ILU(0) 4 BiCGStab 4 BiCGStab 4 5. [2], [12] 4 [13] 4 Lis SSE2 4 [14] Lis DD Level 1 AVX [15] 4 Scilab QuPAT GCR 4 [16] [17] MPFR/GMP BNCpack BiCG GPBiCG CPU GPU 4 4 6. BiCGStab 4 GPU GPU 4 BiCGStab 1 2 4 1 4 4 1 4 1 2 GPU Byte/Flop GPU 4 4 GPU GPU c 2012 Information Processing Society of Japan 7

4 4 4 4 JST CREST on Scilab, Proc. International Conference on Computer Science and Applications 2010 (ICCSA 2010), Springer-Verlag, pp. 60 70 (2010). [16] Saito, T., Ishiwata, E. and Hasegawa, H.: Analysis of the GCR method with mixed precision arithmetic using QuPAT, Journal of Computational Science, Vol. 3, No. 3, pp. 87 91 (2012). [17] Krylov Vol. 2012 HPC 133, No. 25, pp. 1 6 (2012). [1] IEEE Computer Society: IEEE Standard for Floating- Point Arithmetic, IEEE Std 754-2008, pp. 1 58 (2008). [2] Hasegawa, H.: Utilizing the quadruple-precision floatingpoint arithmetic operation for the Krylov Subspace Methods, Proc. SIAM Conference on Applied Linear Algebra (LA03) (2003). [3] Mukunoki, D. and Takahashi, D.: Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs, Proc. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW 2012), The 13th Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC-12), pp. 1378 1386 (2012). [4] NVIDIA Corporation: CUBLAS Library (included in CUDA Toolkit), https://developer.nvidia.com/cublas. [5] NVIDIA Corporation: cusparse Library (included in CUDA Toolkit), https://developer.nvidia.com/cusparse. [6] GPU CRS Vol. 2012-HPC-135, No. 31, pp. 1 6 (2012). [7] Bell, N. and Garland, M.: Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Technical Report, No. NVR-2008-004 (2008). [8] Dekker, T. J.: A Floating-Point Technique for Extending the Available Precision, Numerische Mathematik, Vol. 18, pp. 224 242 (1971). [9] Bailey, D. H.: QD (C++/Fortran-90 double-double and quad-double package), http://crd.lbl.gov/ dhbailey/mpdist/. [10] Davis, T. and Hu, Y.: The University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices/. [11] Lis: a Library of Iterative Solvers for linear systems: http://www.ssisc.org/lis/. [12] Vol. 13, pp. 713 716 (2008). [13] 4 SSE2. Vol. 1, No. 1, pp. 73 84 (2008). [14] AVX Vol. 2012 HPC 135, No. 16, pp. 1 6 (2012). [15] Saito, T., Ishiwata, E. and Hasegawa, H.: Development of Quadruple Precision Arithmetic Toolbox QuPAT c 2012 Information Processing Society of Japan 8