FX10 SIMD SIMD. [3] Dekker [4] IEEE754. a.lo. (SpMV Sparse matrix and vector product) IEEE754 IEEE754 [5] Double-Double Knuth FMA FMA FX10 FMA SIMD

FX,a),b),c) Bailey Double-Double [] FMA FMA [6] FX FMA SIMD Single Instruction Multiple Data 5 4.5. [] Bailey SIMD SIMD 8bit FMA (SpMV Sparse matrix and vector product) FX. DD Bailey Double-Double a) em49@ns.kogakuin.ac.jp b) fujii@cc.kogakuin.ac.jp c) teru@cc.kogakuin.ac.jp [] Double-Double Knuth [] Dekker [4] SIMD IEEE754 a a.hi a.lo 5 =4bit bit IEEE754 bit 5bit IEEE754 8bit 4bit IEEE754 [5] FMA FMA 4 c 5 Information Processing Society of Japan

DD_ADD(a, b, c) { //a = b + c sh = b.hi + c.hi; th = sh - b.hi; tl = sh - th; th = c.hi - th; tl = b.hi - tl; eh = tl + th; eh = eh + b.lo; eh = eh + c.lo; a.hi = sh + eh; a.lo = a.hi - sh; a.lo = eh - a.lo; DD_MUL(a, b, c) { //a = b * c sp = 4779.; p = b.hi * c.hi; tq = sp * b.hi; bh = tq - (tq - b.hi); DD_MUL(a, b, c) { //a = b * c p = b.hi * c.hi; p = b.hi * c.hi - p; p = b.hi * c.lo + p; p = b.lo * c.hi + p; bl = b.hi - bh; a.hi = p + p; tq = sp * c.hi; a.lo = p - (a.hi - p); ch = tq - (tq - c.hi); cl = c.hi - ch; p = bh * ch - p; p = bh * cl + p; p = bl * ch + p; p = bl * cl + p; p = b.hi * c.lo + p; p = b.lo * c.hi + p; a.hi = p + p; a.lo = p - (a.hi - p); (nofma) 4 FMA. FX DD. FX FX FMA 56 HPC-ACE 5 FX FLA FLB FMA step 4 FMA 8bit 56 HPC-ACE[9] SIMD 5 FX :SPARC64 T M IXfx Extensions[7] FMA 5 9 FX 5step FMA FMA 7step FMA FMA step 5 7. =5/7 FMA step SIMD FMA FLA FLB x86 6 step move c 5 Information Processing Society of Japan

6 FX 56. SIMD 6 AoS : Array of Structure (SoA : Structure of Array) SIMD SoA [] A B A.hi B.lo A.lo B.hi AoS 8bit SIMD shuffle SoA shuffle SoA Lis[] FX AoS FX fast dd[8] 4. 4. FX SPARC64 T M VIIIfx@. GHz (8 cores L Cache KiB L Cache 6MiB) DDR SDRAM Bandwidth 64GB/s FX SPARC64 T M IXfx@.848 GHz (6 cores L Cache KiB L Cache MiB) DDR SDRAM Bandwidth 85GB/s C O OpenMP -Kopenmp -Kprefetch conditional -Kdalign -Knoeval -O fma -no-fma K FX Processor SPARC64 T M VIIIfx IXfx Frequency. GHz.848GHz Number of Core 8 6 Number of Register 56 56 L Cache per core KB KB L Cache 6MB MB Memory DDR SDRAM Memory Size 6GB GB Memory Bandwidth 64GB/s 85GB/s Compiler Vectorization Options Options (nofma) Fujitsu Compiler (fccpx) HPC-ACE -Kopenmp -Kprefetch conditional -Kdalign -Knoeval -O -Kopenmp -Kprefetch conditional -Kdalign -Knoeval -O -no-fma FX Name Operation Load Store step nofma (fma) axpy y = αx + y 6 (8) axpyz z = αx + y 6 (8) xpay y = x + αy 6 (8) scale x = αx 5 (7) dot val = x y 6 (8) nrm val = x (6) 4. OpenMP 8 x y z α val 6 SIMD FMA 4 L 4 lis Lis lisbased lis fastdd fast dd scalar simd SIMD nofma fma FMA 5 lis scalar nofma scale FMA 5step 7step..7 SIMD.64 fastdd simd fastdd scalar.8 fast dd Lis c 5 Information Processing Society of Japan

L ( 6MiB) Time [ms] (speed up ratio) lis scalar nofma lis scalar fma lisbased simd nofma 4 lisbased simd fma 5 fastdd scalar 6 fastdd simd scale.5 (. ).4 (. ).9 (.6 ).9 ( 5.98 ).96 (.5 ).4 (.48 ) axpy.48 (. ). (.5 ).8 (.75 ). ( 4.6 ).9 (.4 ).4 (.4 ) xpay.48 (. ). (.5 ).8 (.7 ). ( 4.64 ). (.4 ).4 (. ) axpyz.48 (. ). (.5 ).7 (.77 ). ( 4.7 ). (.4 ).4 (. ) dot.95 (. ).66 (.44 ).7 (.59 ). ( 4.9 ).7 (.44 ).86 (. ) nrm.84 (. ).65 (.8 ). (.7 ).8 ( 4.6 ).6 (.4 ).8 (.4 ) 4 L ( GB) Time [s] (speed up ratio) lis scalar nofma lis scalar fma lisbased simd nofma 4 lisbased simd fma 5 fastdd scalar 6 fastdd simd scale.8 (. ).4 (.7 ). (.64 ). (.6 ).5 (.5 ).5 (.49 ) axpy. (. ).7 (.5 ).4 (.74 ). (.79 ).4 (.4 ).9 (. ) xpay. (. ).7 (.5 ).4 (.74 ). (.7 ). (.4 ).9 (. ) axpyz. (. ).7 (.5 ).4 (.7 ). (.69 ). (.4 ).9 (. ) dot.5 (. ). (.45 ).6 (.6 ). ( 4.7 ).4 (.44 ). (. ) nrm. (. ). (.4 ).5 (.7 ).4 (.7 ). (.4 ). (.4 ) FMA SIMD 5.7 =.7.64 5.98.6 Vector Size.. 8 scale 7 flop performance[gf olps] = V ectorsize/time 9 L 6MB (Vector Size = 6. 5 ) MB (Vector Size =. 6 ).5GFlops 5MB (Vector Size =. 7 ).GFlops scale Store axpy xpay axpyz dot nrm L FX 4. CRS FMA SIMD SpMV FMA SIMD FX 6 performance [GFlops].5.5.5 5 SoA AoS fma nofma simd scalar 4 5 6 lisbased - simd - fma lisbased - simd - nofma lis - scalar - fma lis - scalar - nofma fastdd - simd fastdd - scalar 4 5 6 7 8 7 Vector Size scale( x = αx ) CRS(Compressed Row Storage) [] A nnz row value index pointer index value index nnz pointer row c 5 Information Processing Society of Japan 4

DD_SpMV(A, x, y) { //y = A * x for(i=:i<a.row;++i) { js = A.ptr[i]; je = A.ptr[i+]; vy = _mm_setzero_pd(); for(j=js;j<je;j+=) { va = _mm_load_pd(&a.val[j]) vx = _mm_set_pd(x[a.index[j+]], x[a.index[j]]) DD_MUL(tmp, va, vx,); DD_ADD(vy, vy, tmp,); y[i] = redction(vy); fraction_padding() 8 SIMD SpMV x index value SIMD 8 x SET SpMV A D x DD y DD =A D x DD DD-SpMV FMA A A.hi 7step SIMD SET fraction processing() y reduction() The Univ. of Florida Sparse Matrix Collection[] ( ) row 5 45 A if( j-i )A[i][j] = value else A[i][j] = 5 performance [GFlops] 8 7 6 5 4 CRS CRS - u CRS - u4 CRS - u6 5 5 5 5 4 9 nnz/row DD-SpMV ( ) performance[f lops] = nnz/time 9,,4,6 DD-SpMV row 5 4 nnz/row nnz/row 8 DD-SpMV step FLA/FLB Load,Store,Brunch 8step FLA FLB 7step.6 DD-SpMV DD-SpMV CRS u.6 6 CRS u CRS u6 8 DD-SpMV fraction processing() reduction() DD- SpMV 45 DD-SpMV CRS c 5 Information Processing Society of Japan 5

relative performance 4.5 4.5.5.5.5 CRS CRS - u CRS - u4 CRS - u6 5 5 5 5 4 Matrix Number (sorting by relative performance of CRS - u6) DD-SpMV ( ) performance [GFlops] 6 5 4 CRS CRS - u CRS - u4 CRS - u6 5 5 5 5 4 Matrix Number (sorting by performance of CRS) DD-SpMV ( ) CRS u6 CRS CRS CRS CRS u6 CRS 4.5. nnz/row nnz/row 5. FX FMA SIMD FMA.. SIMD 4 5 x FX 6.6 6 4.5 nnz/row. AICS HPC (4 ) JSPS 544 [] Bailey,D,H., High-Precision Floating-Point Arithmetic in Scientic Computation., computing in Science and Engineering, pp.54-6 (5). [] Hasegawa, H., Utilizing the Quadruple-Precision floating-point Arithmetic Operation for the Krylov Subspace Methods, The 8th SIAM Conference on Applied Linear Algebra (). [] Knuth,D,E., The Art of Computer Programming: c 5 Information Processing Society of Japan 6

Seminumerical Algorithms, Vol., Addison-Wesley (969). [4] Dekker,T., A floating-point technique for extending the available precision, Numerische Mathematik, Vol.8, pp.4-4 (97). [5],,,, AVX BCRS, (ACS), Vol.7, No.4, pp.5- (4). [6] FUJITSU, Super Computer K., http://jp.fujitsu.com/about/tech/k/. [7] FUJITSU, SPARC64 T M IXfx Extensions., http://img.jp.fujitsu.com/downloads/jp/jhpc/sparc64ixfxextensionsj.pdf. [8] FUJITSU, 4 (). [9] FUJITSU, C++PREMEHPC FX, pp.48-6 (). [] SSI, Lis., http://www.ssisc.org/lis/index.ja.html. [],.,, pp.9-4 (). [] R. Barrett et al., Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM pp.57-65 (994). [] The University of Florida Sparse Matrix Collection., http://www.cise.ufl.edu/research/sparse/matrices/. c 5 Information Processing Society of Japan 7