Performance improvement of iterative solver using bit-compression for a sparse matrix

E6- Performance improvement of iterative sover using -compression for a sparse matrix, 7--6, E-mai keno@riken.jp Kenji Ono, RIKE AICS, 7--6 Minatojima-minami-cho, Chuo-ku, Kobe, Japan A nobe Bit-representation/compression technique is proposed to enhance the performance of iterative methods for a arge-size sparse matrix. This technique is appied to the impementation of iterative kernes with Dirichet and eumann boundary conditions. The first advantage of this approach is that it reduces memory traffic from main memory to and effectivey utiizes SIMD units with cache. Secondy, the proposed impementation can repace if-branch statements with mask operations using the expression. This promotes the optimization of code during compiation and run-time. The Red-Back SOR and BiCGstab agorithms are empoyed to investigate the proposed impementation. Consequenty, the proposed approach achieves. times faster than a naïve impementation on both Inte and Fujitsu Sparc architectures.. Poisson () () () () SIMD Roofine () Roofine Operationa Intensity Operationa Intensity. avierstokes Poisson () ( p) = div ( u t ) ϕ, () p u ϕ Poisson () 7 eumann Dirichet (7) Heaviside () (Boundary Condition) H = () (F uid) p Heaviside p p = p H + ( H) p () () () () ( p H ) n = h ϕ ( ) H p n () h H n eumann Heaviside if () SIMD Dirichet Heaviside H D Fig. e ( ) p H = { e p i+ H e D h + ( H e D ) } () p i+ p i H e Ax = b (6) ( p H D H ) p = h ϕ h H ( ) H p n ( H D ) p H n (6) Copyright c by JSFM

E6- p w = n w i- i s e p e = p i+ p i h p i+ i+ + c_t * p(i,j,k+) + c_b * p(i,j,k-) dp = ( (ss + b(i,j,k) ) / dd - pp ) * omg p(i,j,k) = pp + dp res = res + dbe(dp*dp) * dbe( is(bp(i,j,k), Active, ) ) [Bit-reps code] Fig. : eumann and Dirichet boundary conditions for ce i in two dimensions. A eumann is appied at the west ce face, which is soidy shaded. A Dirichet is empoyed at the east ce face, where the boundary vaue is given by the pressure p i+. () 6 6 9 Fig. Diag dag x x x D x eumann Dirichet Encoding; inine int onbit (int idx, const int s) { return ( idx (x<<s) ); } Decoding; #define BIT_SHIFT(a,b) ( (a >> b) & x ). Red-Back SOR RB-SOR pn(i,j,k,n) Fortran [aive code] do coor=, do k=,kx do j=,jx do i=+mod(k+j+coor,), ix, c_w = pn(i,j,k,) c_e = pn(i,j,k,) c_s = pn(i,j,k,) c_n = pn(i,j,k,) c_b = pn(i,j,k,) c_t = pn(i,j,k,6) dd = pn(i,j,k,7) pp = p(i,j,k) ss = c_e * p(i+,j,k ) + c_w * p(i-,j,k ) + c_n * p(i,j+,k ) + c_s * p(i,j-,k ) do coor=, do k=,kx do j=,jx do i=+mod(k+j+coor,), ix, idx = bp(i,j,k) c_e = rea( is(idx, _dag_e, ) ) c_w = rea( is(idx, _dag_w, ) ) c_n = rea( is(idx, _dag_, ) ) c_s = rea( is(idx, _dag_s, ) ) c_t = rea( is(idx, _dag_t, ) ) c_b = rea( is(idx, _dag_b, ) ) d = rea( is(idx, _Diag+, ) ) d = rea( is(idx, _Diag+, ) ) d = rea( is(idx, _Diag+, ) ) dd = d*. + d*. + d pp = p(i,j,k) ss = c_e * p(i+,j,k ) + c_w * p(i-,j,k ) + c_n * p(i,j+,k ) + c_s * p(i,j-,k ) + c_t * p(i,j,k+) + c_b * p(i,j,k-) dp = ( (ss + b(i,j,k) ) / dd - pp ) * omg p(i,j,k) = pp + dp res = res + dbe(dp*dp) * dbe( is(idx, Active, ) ) p, b, bp, pn (6) Fig. is(bp(i,j,k), Active, ) pn b bp p p, b, bp / Operationa Intensity Fop/Byte, pn, b, bp 9 p i-, i, i+, j-, j+ k-, k+ b, bp p i-, i, i+, j-, j+ 6 Fop/Byte Tabe fop 8fops Sparc Copyright c by JSFM

E6-9 7 8 9 _Diag (~6) _dag_e _dag_w _dag_s _dag dag_t _dag_b W E S B _D_W T _D_S _D_E _D_T _D_B _D_ State Active Fig. : Bit representation. Severa s required for the -representation are encoded into this array. This exampe incudes diagona( Diag), non-diagona ( dag x), eumann boundary( x), Dirichet boundary ( D x), ce state (State), and activeness (Active) of a ce. Other s are used for more compicated processes. Tab. : Specification of evauation machines. TRIAD scores are measured by the STREAM benchmark (). Architecture Cock CPU Peak Cache Memory Theoretica TRIAD (GHz) () (MB) (GB) BW (GB/s) (GB/s) Xeon X6.66 6 7.7 6 6 Xeon E-67.6 8 66. 6 9 Xeon E-68. 8 96. 6 9 Sparc VIIIfx. 8 8. 6 6 6 6 Sparc IXfx.8 6 6. 8 Tab. : Comparison of characteristic for two types of impementation. aïve Bit-Reps. Memory Requirement unit unit Load & Store + + Arithmetic 6 F/B... Tabe () φ = Dirichet/eumann () Performance Monitor ibrary (PMib) (8), () PMib PAPI Fig. Fujitsu Venus IXfx 6 L Fujitsu Venus VIIIfx Fujitsu Venus IXfx FX textitfujitsu Venus VIIIfx is(a, b, ) SIMD Fujitsu Venus IXfx Fig. 6 6 Inte Fujitsu Venus VIIIfx IXfx VIIIfx Inte 6 8. IXfx 6 Tabe Fig. F/B=. SIMD Westmere(X6) Sparc VIIIfx F/B=. Sparc IXfx SIMD Inte is(a, b, x) x= Copyright c by JSFM

E6- Fig. FFV-C (9) PFops FFV-C % Attainabe Performance () Westmere Sparc VIIIfx Sparc IXfx Westmere -reps Westmere Sparc VIIIfx -reps Sparc VIIIfx Sparc IXfx -reps Sparc IXfx!"!# $!!!"!#! Operationa Intensity (Fops/Byte) Fig. : Performance anaysis of Roofine mode for naïve and -reps impementation. GFLOPS x 6 x x x x x Idea FFV-C x x x x x umber of Processes () S. Wiiams, S., Waterman, A. and Patterson, D.: Roofine; An Insightfu Visua Performance Mode for Muticore Arch. Commun. ACM, Vo. o. (9) 6 76 () Yokokawa, M. : Vector-Parae Processing of the Successive Overreaxation Method. Japan Atomic Energy Research Institute JAERI-M Report o. 88-7 (988) in Japanese () Wicock, J. and Lumsdaine, A.: Acceerating sparse matrix computations via data compression. Proc. th Annua ICS 6 (6) 7 6 () Tang, W. T., et a.: Acceerating Sparse Matrix-vector Mutipication on GPUs Using Bitrepresentation-optimized Schemes. Proc. of SC 6 () () Van der Vorst, H. A. : Bi-CGSTAB: A Fast and Smoothy Converging Variant of Bi-CG for the Soution of onsymmetric Linear Systems. SIAM J. Sci. and Stat. Comput. Vo.bf o. (99) 6-6 (6) Ono, K. and Kawashima, Y. : Muticoor SOR Method with Consecutive Memory Access Impementation in a Shared and Distributed Memory Parae Environment. Lecture otes in Computationa Science and Engineering, Vo.7 () 8 9 (7) Ono, K., Chiba, S., Inoue, S., and Minami, K. : Performance Improvement of Iterative Methods using a Bit-Representation Technique for Coefficient Matrices. Vecpar, () (8) Ono, K., Kawashima, Y. and Kawanabe, T.: Data Centric Framework for Large-scae Highperformance Parae Computation. Procedia Computer Science, Vo.9 () 6 (9) http://avr-aics-riken.github.io/ffvc\ _package/ () http://avr-aics-riken.github.io/pmib/ () http://www.cs.virginia.edu/stream Fig. : Measured performance of FFV-C code on the K computer with 8,9 nodes. Each node has 8 cores.. Poisson Inte Sparc. SIMD Copyright c by JSFM

E6-6 6 8 6 6 6 8 6 (a) Inte Xeon X6. (b) Inte Xeon E-67. 6 6 8 6 6 6 8 6 (c) Fujitsu Sparc Venus VIIIfx. (d) Fujitsu Sparc Venus IXfx. Fig. : Comparison of seria performance of each machine. The probem size varies ranging from 6 to 6. Copyright c by JSFM

E6-6 8 6 8 6 (a) Inte Xeon X6. (b) Inte Xeon E-67. 6 8 6 8 6 (c) Fujitsu Sparc Venus VIIIfx. (d) Fujitsu Sparc Venus IXfx. Fig. 6: Comparison of thread parae performance of each machine. The probem size is chosen to 6 so that the data resides in main memory. 6 Copyright c by JSFM