FPGA. Fast and Efficient Tsunami Propagation Simulation with FPGA and GPGPU

FPGA GPGPU 1 1 2, 3, 1 2, 3 FPGA(Field Programmable Gate Array) GPGPU(General Purpose computing on Graphics Processing Unit) FPGA GPU FPGA GPU CPU Fast and Efficient Tsunami Propagation Simulation with FPGA and GPGPU Hideo Tanida, 1 Akira Fukui, 1 Hiroaki Yoshida 2, 3, 1 and Masahiro Fujita 2, 3 Custom accelerators implemented on FPGA and GPUs are both considered to be solutions to achieve high performance and efficiency at relatively low cost. This paper discusses accelerations of tsunami-propagation simulation based on finite difference method, making use of FPGA and GPU. Experimental results show optimizations with memory hierarchy taken into consideration are effective for implementations on both FPGA and GPU. Both of executions assisted by FPGA and GPU show higher energy efficiency compared to the execution only on general-purpose processor. 1. FPGA(Field Programmable Gate Array) GPGPU(General Purpose computing on Graphics Processing Unit) FPGA FPGA FPGA C/C++ RTL RTL GPGPU(General Purpose Computing on Graphics Processing Unit) GPU FPGA 2) 2011 11 10 GPU 1 Dept. of Electrical Engineering and Information Systems, The University of Tokyo 2 VLSI Design and Education Center, The University of Tokyo 3 CREST CREST, Japan Science and Technology Agency 1 Presently with Fujitsu Laboratories of America, Inc. 1 c 2012 Information Processing Society of Japan

GPU GPU FPGA GPU 2 3 4 FPGA GPU 5 6 2. TUNAMI-N1 TUNAMI-N1 (Tohoku University s Numerical Analysis Model for Investigation of Nearfield tsunamis, No.1) 1) 2.1 (Long Wave Theory) (2) (2)(3) η t + M x + N =0 (1) y M M η + gd =0 (2) t t x N η + gd =0 (3) t y 2.2 TUNAMI-N1 TUNAMI-N1 TUNAMI-N1 TUNAMI-N1 2 H[IF][JF] IF JF 3 Z( ),M(),N() Z[IF][JF],M[IF][JF],N[IF][JF] t t+1 (T ) TUNAMI-N1 TUNAMI-N1 CPU 0 1 2 c 2012 Information Processing Society of Japan

2.3 FPGA GPU t i j Z t [j][i] =Z t [j][i] R (M t [j][i] M t [j][i 1] + N t [j][i] N t [j 1][i]) (4) j =0, 0 <i j = JF 1, 0 <i i =0, 0 <j i = IF 1, 0 <j Z t+1[j][i 1] 1 =(Z t [j][i] ( N t [j][i]+(m t [j][i] M t [j][i 1])/500))/2 (5) G H[j][i] Z t+1 [j][i] 1 =(Z t [j][i] ( M t [j][i]+(n t [j][i] N t [j 1][i])/500))/2 (6) G H[j][i] Z t+1 [j][i] 1 =(Z t [j][i] ( N t [j][i]+(m t [j][i] M t [j][i 1])/500))/2 (7) G H[j][i] Z t+1 [j][i] 1 =(Z t[j][i] ( M t[j][i]+(n t[j][i] N t[j 1][i])/500))/2 (8) G H[j][i] M t+1 [j][i] = M[j][i] G R (H[j][i]+H[j][i +1]) (Z t+1[j][i +1] Z t+1[j][i])/2 (9) N t+1 [j][i] = N[j][i] G R (H[j +1][i]+H[j][i]) (Z t+1 [j +1][i] Z t+1 [j][i])/2 (10) Z t+1 TUNAMI Z t Z t+1 M t,n t 1 2.4 TUNAMI-N1 1 TUNAMI-N1 TUNAMI 3. FPGA TUNAMI-N1 FPGA FPGA Virtex6 SX475T(FPGA) FPGA 24GB SDRAM( ) FPGA PCI Express CPU (Intel Xeon X5650 @2.67GHz ) Maxeler Technologies MaxCompiler MaxCompiler Java VHDL VHDL FPGA Xilinx FPGA C CPU () FPGA ( ) 2 3 c 2012 Information Processing Society of Japan

2 FPGA H() Z( ), M(),N() FPGA FPGA 3.1 2 1 Z, M, N, H CPU MaxCompiler FPGA Z Z, M, N H 2000 1040*668*4*4Byte FPGA BRAM(SRAM) FPGA FPGA (SDRAM) 24GByte FPGA BRAM BRAM 3.2 FPGA %FPGA FPGA 3 3 3 3 1 3 4 1 t 2 t+1 3 t+2 1 3 2 1 3 2 4 c 2012 Information Processing Society of Japan

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP 3 4 1 FPGA BRAM 4. GPU TUNAMI-N1 GPU 4) GPU NVIDIA Tesla C2075 CPU Intel Xeon X5650 @2.67GHz 5 Tesla C2075 14 Streaming Multiprocessor(SM) SM 32 Streaming Processor(SP) SP L1 L2 5 GPU(Tesla C2075) SP 1 1SP 1024byte L1 4 SM SM L2 6GByte 4.1 TUNAMI-N1 GPU GPU 4) GPU CPU () GPU ( ) 6 Z( 5 c 2012 Information Processing Society of Japan

6 GPGPU ),M(), N() GPU GPU 2 1040*668 7 16 16 SM 65 42 7 SM GPU 2 SM SM TUNAMI-N1 Z 4.2 7200 GPU 1 1040 668 17 7200 = 85, 033, 728, 000 GPU 1,050[GFLOPS] 85, 033, 728, 000/1, 050, 000, 000, 000 = 6 c 2012 Information Processing Society of Japan

0.081[sec.] 2.8[sec.] 34.6 CGMA(Compute to Global Memory Access: ) CGMA 1 4 5 ( ) 1 2 4 10 2 1 2 11 17 CGMA 17/11 = 1.55 GPU 144[GHz] (144/4) 1.55 = 55.8[GFLOPS] 7200 85, 033, 728, 000/55, 800, 000, 000 = 1.52[sec.] 2.8[sec.] 4 17 3 SM 16 16 18 18 8 SM 8 SM 9 7200 syncthreads() 9 5. FPGA GPU (CPU ) FPGA (FPGA ) GPU ( GPU ) CPU Fortran TUNAMI-N1 C FPGA 3.2 GPU 4.2 5.1 1 CPU FPGA GPU 7200 ( 2 ) 7200 86400 ( 24 ) FPGA 7200 7 c 2012 Information Processing Society of Japan

1 (sec.) 7200 7200 86400 () () () CPU 78.7(x1) 80.1(x1) 943(x1) FPGA 1.85(x42.5) 6.22(x12.9) 26.57(x35.5) GPU 2.05(x38.4) 4.71(x17.0) 31.1(x30.32) 2 (W) (J) CPU 24 1888.8 FPGA 42 77.7 GPU 129 264.45 GPU GPU 4.1 34% 5.2 CPU FPGA GPU 86400 2 FPGA GPU CPU CPU FPGA GPU FPGA GPU 1/3 FPGA GPU 3) 1) Fumihiko Imamura, Ahmet Cevdet Yalciner, and Gulizar Ozyurt, TSUNAMI MODELLING MANUAL, available from <http://www.tsunami.civil.tohoku.ac.jp/hokusai3/j/projects/manual-ver-3.1.pdf>, accessed 2012-02-13. 2) Fumihiko Ino, Jun Gomita, Yasuhiro Kawasaki, and Kenichi Hagihara, A GPGPU approach for accelerating 2-D/3-D rigid registration of medical images, in Proc. Parallel and Distributed Processing and Applications (ISPA), vol. 4330, pp. 939-950, 2006. 3) Dong-U. Lee, Altaf Adbul, Ray C. C. Cheung, Oskar Mencer, Wayne Luk, George A., and Constantinides, Accuracy-guaranteed bit-width optimization, IEEE Transactions Computer-Aided Design of Integrated Circuits and Systems, vol.25, no. 10, pp. 1990-2000, Oct. 2006. 4) Harsh Gidra, Israrul Haque, Nitin P. Kumar, Sargurunathan M., M. S. Gaur, Vijay, Laxmi, M.Zwolinski, and Virendra Singh, Parallelizing TUNAMI-N1 using GPGPU, in Proc. IEEE International Conference on High Performance Computing and Communications (HPCC), pp. 845-850, Sep. 2011. 6. FPGA GPU FPGA FPGA BRAM GPU 8 c 2012 Information Processing Society of Japan