Efficient Implementation of Sparse Linear Algebra Operations on InfiniBand Cluster. Akira Nishida,

InfiniBand,,.,, PCI Express InfiniBand,,.,,. Efficient Implementation of Sparse Linear Algebra Operations on InfiniBand Cluster Akira Nishida, Construction of scalable and low cost parallel computing environment is indispensable for the efficient solution of linear systems with large sparse matrices. In this study, we build a PC cluster based on PCI Express and InfiniBand technology for the scalable implementation of parallel iterative linear solvers, and evaluate its performance. The potential ability and the problems to be solved for commodity interconnect technologies are also discussed. The results of our evaluation show that the communication bandwidth of the intranode interconnects is the most critical factor for the scalable implementation of parallel sparse matrix computations. 1., PC, 3),7).,, PC,,.,. 1., 2, -.,,,,., Department of Computer Science, the University of Tokyo CREST CREST, JST, PC,. 1. Choose x 2. p = r = b Ax k = 3. α k =(r k,p k )/(p k,ap k ) 4. x k+1 = x k + α k p k 5. r k+1 = r k α k Ap k 6. β k =(r k+1,r k+1 )/(r k,r k ) 7. p k+1 = r k+1 + β k p k Increment k and if not convergent goto 3 1 Fig. 1 Algorithm of Conjugate Gradient Method. 2.,. Myrinet Quadrics,,, Intel, InfiniBand., PCI 1

, 4Gbps. 2. Fig. 2 Vector operations for iterative solvers., PCI Express,.,,,.. PCI Express PCI Express PCI, Intel, NEC 24. PCI 1GB/s, PCI Express, 2.5Gb/s 32, 16GB/s ( 3 ). AGP (Accelerated Graphic Port),. Processor Controller PCI-X Bridge FSB I/O Device PCI-X Bus PCI Express Link Processor Controller I/O Device FSB 3 PCI-X, PCI Express Fig. 3 Comparing PCI-X with PCI Express. InfiniBand PCI Express, InfiniBand. Mellanox Technologies InfiniHost (HCA), 1 2.5Gbps InfiniBand 4 2 ( 4 ), 8 PCI Express 4 Mellanox Technologies PCI Express InfiniHost HCA MHEA28-XT. Fig. 4 InfiniHCA HCA MHEA28-XT for PCI Express bus from Mellanox Technologies, Inc., InfiniBand, Mellanox Technologies 1 2Gb/s. Opteron AMD 64bit,.,. Opteron cc-numa,, 6).,, 2.6.4 SuSE Linux 9.1. 2.6 Linux, Opteron., Mellanox Technologies PCI Express dual port InfiniHost HCA, 128MB DDR MHEL-CF128-T, MHEA28-XT 2 HCA. 24 InfiniBand MTS24.,, PCI Express Intel 64bit Xeon AMD Athlon64 InfiniBand 8B/1B, 8% 4GB/s. 2

. Dell PowerEdge SC 142, Intel E752, Asus A8N-SLI Deluxe, NVIDIA nforce4 SLI MHEL-CF128-T,, 1.8GHz, 2.2 GHz Athlon64 97MB/s, 2.8HGz Xeon (1 ) 8MB/s. MPI 4us, Mellanox 3.7us., Opteron, PCI Express NVIDIA nforce Professional 22. nforce Professional, 25 3 Rioworks 2-way HDAM Express, 2GHz, 1MB Opteron 246, 8 16,. 512MB PC32 DDR SDRAM (ECC Registered) 4. 5 InfiniBand MTS24 Opteron Fig. 5 Opteron Cluster connected with InfiniBand switch MTS24. 3. InfiniBand MPI, MVAPICH 4) LAM MPI 2), MPICH/SCore 8). MVAPICH, 5), PCI Express InfiniHost HCA 2 4x, MVAPICH.9.5., 1, chp4 MPICH 1.2.7 GbE. GbE 32bit, 66MHz PCI RTL8169 PCI-X.,, SGI Altix 37 MPI. Altix, NUMAflex Intel Itanium2. 1 2 6.4GB/s, 4 3.2GB/s., 32KB L1 (16KB data), 256KB L2, 3MB L3, 1GB 1.3GHz Itanium2 32 Altix 37, 16. 6-8 MHEL-CF128-T, MHEA28-XT MPI. GbE MPI 28.71us, 58MB/s, InfiniHost HCA MPI 3.99us, 297MB/s (MHEL- CF128-T).,., MPI,,., MHEA28-XT MHEL-CF128-T, 4MB, 8, MHEL-CF128-T,., -3 MHEL-CF128-T, 4-7 MHEA28-XT. PCI-X, 6us, 1877MB/s 5), PCI Express,., MVAPICH Ohio, 3.4GHz EM64T Xeon 274MB/s, 7%. Athlon64 Opteron,,,., Altix 37 9-1. InfiniBand, http://nowlab.cis.ohio-state.edu/projects/mpi-iba/ 3

,. Time (us) 1 1 1 Intranode, Singlerail Intranode, Multirail Internode, Singlerail Internode, Multirail.1 1 1 1 1e+6 1e+7 6 InfiniHost HCA (MHEL-CF128-T) MPI Fig. 6 MPI Latency of InfiniHost HCA (MHEL-CF128- T) with different number of communication ports. Bandwidth (MB/s) 3 25 2 15 5 Singlerail Multirail 1 1 1 1e+6 1e+7 8 InfiniHost HCA (MHEA28-XT). Fig. 8 MPI Bidirectional Bandwidth of InfiniHost HCA (MHEA28-XT) with different number of communication ports. Altix (intranode) Altix (internode) Bandwidth (MB/s) 3 25 2 15 5 Intranode, Singlerail Intranode, Multirail Internode, Singlerail Internode, Multirail 1 1 1 1e+6 1e+7 7 InfiniHost HCA (MHEL-CF128-T). Fig. 7 MPI Bidirectional Bandwidth of InfiniHost HCA (MHEL-CF128-T) with different number of communication ports. 3.1 STREAM benchmark,, STREAM benchmark 1), Opteron SGI Altix 37,.,,. MPI stream mpi.f,. 11 STREAM benchmark. 4,,, 91.6MB. 1 2.5GB/s, SGI Altix 37 OpenMP STREAM Benchmark ( Time (us) 1 1 1 1 1 1 1e+6 9 SGI Altix 37 MPI Fig. 9 MPI Latency on SGI Altix 37. 1 STREAM benchmark Table 1 STREAM benchmark types. Benchmark Operation Bytes per iteration Copy a[i] = b[i] 16 Scale a[i] = q * b[i] 16 Add a[i] = b[i] + c[i] 24 Triad a[i] = b[i] + q * c[i] 24 12),. 4. NAS Parallel Benchmark,, NAS Parallel Benchmark 3.2, MPI CG C. CG. NAS Parallel Benchmark 4

3 25 Altix (intranode) Altix (internode), 13 Altix, CG. Bandwidth (MB/s) 2 15 5 1 1 1 1e+6 1e+7 1 SGI Altix 37. Fig. 1 MPI Bidirectional Bandwidth on SGI Altix 37. memory bandwidth (MB/s) 45 4 35 3 25 2 15 5 Copy Scale Add Triad 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 11 2MPI STREAM benchmark Fig. 11 Speedup of STREAM benckmark results with two processes per node. memory bandwidth (MB/s) 6 5 4 3 2 Copy Scale Add Triad 5 1 15 2 25 3 number of threads 12 SGI Altix ( 8,,) STREAM benchmark (MB/s) Fig. 12 STREAM benchmark performance in MB/s with array size 8,, on SGI Altix with memory affinity performance (MFLOPS) 4 35 3 25 2 15 5 13 SGI Altix 37 NPB CG Fig. 13 Speedup of NPB CG on SGI Altix 37.,,. GbE, InfiniHost HCA 1, 2. GbE,,,, InfiniHost HCA,,,., CG ( ),,, (CG, ). performance (MFLOPS) 4 35 3 25 2 15 5 14 GbE 1 NPB CG Fig. 14 Speedup of NPB CG with one port of GbE. 5

performance (MFLOPS) 4 35 3 25 2 15 5 15 InfiniHost HCA 1 NPB CG Fig. 15 Speedup of NPB CG with one port of InfiniHost HCA. performance (MFLOPS) 4 35 3 25 2 15 5 16 InfiniHost HCA 2 NPB CG Fig. 16 Speedup of NPB CG with two ports of InfiniHost HCA. 5.,,,,.,,,. 6., SGI Altix 37,.,,,,.,,., InfiniBand, Mellanox Technologies, Inc.,.,, 1616225,. 1) STREAM: Sustainable Bandwidth in High Performance Computers, http://www.cs.virginia.edu/stream/. 2) G. Burns, R. Daoud, and J. Vaigl, LAM: An Open Cluster Environment for MPI, in Proceedings of Supercomputing Symposium, 1994, pp. 379 386. 3) J. I. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, Third Edition, Morgan Kaufmann, 23. 4) J. Liu, B. Chandrasekaran, W. J. J. Wu, S. P. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. K. Panda, Performance Comparison of MPI implementations over Infiniband, Myrinet and Quadrics, in Proceedings of the International Conference on Supercomputing 3, 23. 5) J.Liu, A.Vishnu, and D.K. Panda, Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation, in Proceedings of the International Conference on Supercomputing 4, 24. 6) A. Nishida and Y. Oyanagi, Performance Evaluation of Low Level Multithreaded BLAS Kernels on Intel Processor based cc-numa Systems, LNCS, 2858 (23), pp. 5 51. 7) G. F. Pfister, In Search of Clusters, Prentice-Hall, second ed., 1998. 8) S. Sumimoto, A. Naruse, K. Kumon, K. Hosoe, and T. Shimizu, PM/InfiniBand- FJ: a high performance communication facility using InfiniBand for large scale PC clusters., PCI Express InfiniBand 6