Binary32 (a hi ) 8 bits 23 bits Binary32 (a lo ) 8 bits 23 bits Double-Float (a=a hi +a lo, a lo 0.5ulp(a hi ) ) 8 bits 46 bits Binary64 11 bits sign

Maxwell GPU DGEMM 1,a) 1,b) NVIDIA 2014 Maxwell GM107 GM204 GPU : =1:32 GM204 GeForce GTX 980 2 double-float DF BLAS DGEMM DGEMM DF DGEMM 2 1. IEEE 754-2008[1] binary32 binary64 NVIDIA GPU 2010 Fermi : 1:2 2012 Kepler 1:3 2014 Maxwell GPU GM107 GM204 : 1:32 GM107 GM204 Fermi Kepler : 1:12 1:24 *1 GPU CUDA GPU 1 a) daichi.mukunoki@riken.jp b) imamura.toshiyuki@riken.jp *1 GPU : =1:24 Kepler GK104 Tesla K10 GPU GPU Cell Broadband Engine Cell/B.E. PLAYSTATION3 : 1:14 Dekker [2] 2 4 Double-Double DD Double-Float DF GPU Fused Multiply-Add FMA 2 a b + c 16 GM107 GM204 Maxwell GPU DF GM204 Maxwell 1

Binary32 (a hi ) 8 bits 23 bits Binary32 (a lo ) 8 bits 23 bits Double-Float (a=a hi +a lo, a lo 0.5ulp(a hi ) ) 8 bits 46 bits Binary64 11 bits sign (1 bit) exponent significand 52 bits 1 DF binary64 GPU GeForce GTX 980[3] C = αab + βc BLAS DGEMM DF GEMM O(N 2 ) O(N 3 ) GPU GPU DF [4] [5] DF DGEMM DD 4 GEMM [6][7] 2. DF 2.1 DF DF 1 a 2 binary32 a hi a lo a = a hi + a lo 1 a lo 0.5ulp(a hi )) a hi a lo Binary32 1 24 DF 24 2 = 48 binary64 53 5 1 2 binary32 8 DF float 2 cudfreal cudfreal 8 GPU binary64 8 DF Array-of-Structure 2.2 DF 2 2 DF QD [8] Sloppy 2 [9] 3 48 DF FMA FMA FMA DF a b + c 16 2.3 Binary64 DF Binary64 DF 1 2 SPLIT [2] 4 536870913 2 29 + 1 binary64 d 2 29 + 1 t d 53bit 29bit t d d 24bit 29bit 29bit binary32 5bit DF binary64 DF binary64 3. GEMM GEMM C = αab + βc BLAS BLAS 2

device forceinline cudfreal DFadd (const cudfreal a, const cudfreal b){ } cudfreal t, c; float v; t.x = v = fadd rn (a.x, b.x); fsub rn (t.x, a,x); t.y = fadd rn ( fsub rn (a.x, fsub rn (t,x, v)), fsub rn (b.x, v)); t.y = fadd rn (t.y, fadd rn (a.y, b.y) ); c.x = fadd rn (t.x, t.y); c.y = fsub rn (t.y, fsub rn (c.x, t.x)); return c; 2 DF device forceinline cudfreal DFmul (const cudfreal a, const cudfreal b){ } cudfreal c; float t, v, e; v = t = c.x = e = c.y = return c; fmul rn (a.x, b.y); fmaf rn (a.y, b.x, v); fmaf rn (a.x, b.x, t); fmaf rn (a.x, b.x, -c.x); fadd rn (e, t); 3 DF device forceinline cudfreal double to cudfreal (const double d){ } double t = dmul rn (536870913.0, d); double h = dsub rn (t, dsub rn (t, d)); double l = dsub rn (d, h); return make cudfreal ((float)h, (float)l); 4 Binary64 DF 3 DGEMM binary64 binary64 DGEMM DGEMM DF DF DGEMM DGEMM-DF(DF) binary64 DF DGEMM DGEMM-D(DF) GPU binary64 DF DGEMM-DF(DF) DGEMM binary64 DGEMM-D(DF) DGEMM-D(DF) DGEMM DGEMM-D(DF) double to cudfreal 4 binary64 DF N N O(N 2 ) O(N 3 ) O(N 3 ) CUDA GEMM [10][11][12] C = AB A M K B K N C M N 2MNK A B C BM BK BK BN BM BN MNK(1/BM + 1/BN) M = N = K BM = BN 2N 3 2N 3 /BN 1/BN 3.1 GeForce GTX 980 GeForce GTX 980 SMM 16 1 128 4 1216MHz 16 128 1.216 2 = 4980.736GFlops 16 4 1.216 2 = 155.648GFlops 2 a b + c 2 DF DF 1 16 DF 16 128 1.216 2/16 = 311.296GFlops DF 1 1Flops 3.2 DGEMM a b + c 2Flops 3 64bit (3 8)/2 = 12Bytes/Flop 3

N BN=48 Sub-mat B K Shared-memory BK =16 Matrix B Sub-mat A Sub-mat B M Sub-mat A Sub-mat C BN =48 Shared-memory Matrix A K Matrix C BK =16 Sub-mat C 5 DF DGEMM 180GB/s 156GFlops 156 12 = 1872GB/s 180GB/s BN 1872/180 = 10.4 BN 11 GPU 128Bytes 64bit 16 BN BK BK =16 A B BN 16 BN 11 16 BK =16 16 16 2 1 (BN/16) (BN/16) (BN/16) (BN/16) BN 1 CUDA 3 4 4 GPU 1 96KB 1 48KB 1 32bit 65536 1 16 16 = 256 4 1 65536/256/4 =64 BN 64 64 16 16 16 1 4 4 3.3 DF DGEMM DGEMM-DF(DF) DGEMM-D(DF) DGEMM DF 311GFlops 311 12 = 3732GB/s 180GB/s BN 3732/180 21 DGEMM BK =16 4 BN BN 48 DGEMM DGEMM DF DGEMM-D(DF) binary64 DF DGEMM-DF(DF) 5 DF DGEMM DGEMM-DF(DF) DGEMM-D(DF) 48 16 16 16 1 3 3 4. 4.1 GeForce GTX 980 CPU Intel Core i7-4790 (3.60GHz 4 ) 16 GB OS CentOS 7.0 3.10.0-123.8.1.el7.x86 64 GeForce GTX 980 4

CUDA Version 6.5 *2 nvcc V6.5.16 gcc 4.8.2 nvcc -arch=sm 52 -O3 -arch=sm 52 GM204 C = αab + βc N N N 128 8192 128 α β rand 5 Flops Flops DF 1 1Flop CUDA BLAS CUBLAS 6.5[13] MAGMA 1.5.0[14] SGEMM DGEMM MAGMA Maxwell Kepler GFlops 4500 4000 3500 3000 2500 2000 1500 1000 500 CUBLAS 6.5 MAGMA 1.5.0 0 0 2048 4096 6144 8192 300 250 6 SGEMM (GeForce GTX 980) N SGEMM DGEMM (GeForce GTX 980) 4.2 SGEMM 6 DGEMM 7 SGEMM DGEMM 1 DGEMM DGEMM-DF(DF) 2 CUBLAS SGEMM DGEMM 89% DGEMM DGEMM-DF(DF) DGEMM-D(DF) 71.6% DGEMM-D(DF) DGEMM- DF(DF) D-DF DGEMM-D(DF) binary64 DF Binary64 DF 1/32 6 DF DGEMM 1 NVIDIA Management Library NVML 80 CUBLAS SGEMM 80 *2 CUDA 6.5 GeForce GTX 970/980 Version 6.5.19 GFlops 200 150 100 CUBLAS 6.5 MAGMA 1.5.0 50 This work This work (DGEMM-D(DF)) This work (DGEMM-DF(DF)) 0 0 2048 4096 6144 8192 5. 7 N DGEMM : 1:32 Maxwell GPU DF BLAS DGEMM DF 1/16 GeForce GTX 980 DGEMM DF DGEMM 2 DF binary64 GEMM 1/16 GeForce GTX 980 GPU DF CUDA GPU Cell/B.E. 5

1 GeForce GTX 980 GEMM SGEMM (CUBLAS) 4981 GFlops 4448 GFlops (N=3712) 89.3 % SGEMM (MAGMA) 4981 GFlops 2699 GFlops (N=3840) 54.2 % DGEMM (CUBLAS) 155.6 GFlops 139.4 GFlops (N=2944) 89.6 % DGEMM (MAGMA) 155.6 GFlops 138.2 GFlops (N=2944) 88.8 % DGEMM 155.6 GFlops 138.0 GFlops (N=2944) 88.7 % DGEMM-D(DF) 311.3 GFlops 222.8 GFlops (N=2304) 71.6 % DGEMM-DF(DF) 311.3 GFlops 277.3 GFlops (N=1536) 89.0 % DF JSPS 251290 [1] IEEE Computer Society: IEEE Standard for Floating- Point Arithmetic, IEEE Std 754-2008, pp. 1 58 (2008). [2] Dekker, T.: A Floating-Point Technique for Extending the Available Precision, Numerische Mathematik, Vol. 18, pp. 224 242 (1971). [3] NVIDIA Corporation: Whitepaper NVIDIA GeForce GTX 980 Featuring Maxwell, http://international.download.nvidia.com/geforcecom/international/pdfs/geforce GTX 980 Whitepaper FINAL.PDF (2014). [4] Graça, G. D. and Defour, D.: Implementation of floatfloat operators on graphics hardware, Proc. 7th Conference on Real Numbers and Computers (RNC7), pp. 23 32 (2006). [5] Thall, A.: Extended-Precision Floating-Point Numbers for GPU Computation, ACM SIGGRAPH 2006 Research Posters (2006). [6] Nakata, M., Takao, Y., Noda, S. and Himeno, R.: A Fast Implementation of Matrix-matrix Product in Double-double Precision on NVIDIA C2050 and Application to Semidefinite Programming, Proc. 3rd International Conference on Networking and Computing (ICNC 2012), pp. 68 75 (2012). [7] GPU 3 4. Vol. 6, No. 41, pp. 66 77 (2013). [8] Hida, Y., Li, X. S. and Bailey, D. H.: QD (C++/Fortran- 90 double-double and quad-double package), http://crd.lbl.gov/ dhbailey/mpdist/. [9] Nagai, T., Yoshida, H., Kuroda, H. and Kanada, Y.: Fast Quadruple Precision Arithmetic Library on Parallel Computer SR11000/J2, Proc. 8th International Conference on Computational Science, Part I, ICCS 08, pp. 446 455 (2008). [10] Nath, R., Tomov, S. and Dongarra, J.: An Improved MAGMA GEMM for Fermi GPUs, University of Tennessee Computer Science Technical Report, No. UT-CS- 10-655 (2010). [11] Tan, G., Li, L., Triechle, S., Phillips, E., Bao, Y. and Sun, N.: Fast Implementation of DGEMM on Fermi GPU, Proc. 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 11, No. 35, pp. 1 11 (2011). [12] Lai, J. and Seznec, A.: Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO 13, pp. 1 10 (2013). [13] NVIDIA Corporation: CUBLAS Library (included in CUDA Toolkit), https://developer.nvidia.com/cublas. [14] University of Tennessee: Matrix Algebra on GPU and Multicore Architectures (MAGMA), http://icl.cs.utk.edu/magma/. 6