CUDA FFT. High Performance 3-D FFT in CUDA Environment. Akira Nukada, 1, 2 Yasuhiko Ogata, 1, 2 Toshio Endo 1, 2 and Satoshi Matsuoka 1, 2, 3

Vol. 1 No. 2 231 239 (Aug. 2008) CUDA 3 FFT 1, 2 1, 2 1, 2 1, 2, 3 NVIDIA GPU CUDA CUDA 3 FFT GeForce 8 GPU 3 FFT CUFFT 1.1 3.1 3.3 79.5 GFLOPS High Performance 3-D FFT in CUDA Environment Akira Nukada, 1, 2 Yasuhiko Ogata, 1, 2 Toshio Endo 1, 2 and Satoshi Matsuoka 1, 2, 3 CUDA environment, which is supported in latest NVIDIA GPUs, allows data sharing between threads using shared memory, and also provides more flexible memory accesses. We propose a high performance 3-D FFT algorithm for the CUDA environment. Using GeForce 8 series GPUs, we achieved a high performance up to 79.5 GFLOPS at 3-D FFT, which is from 3.1 to 3.3 times the performance compared with the performance of CUFFT library 1.1. 1. Graphics Processing Unit GPU 3 1 Tokyo Institute of Technology 2 Japan Science and Technology Agency, Core Research for Evolutional Science and Technology 3 National Institute of Informatics GPU CPU CPU GPU GPU GPU General-Purpose computation on Graphics Processing Unit GPGPU 1) GPU NVIDIA Cg 2) Microsoft High Level Shader Language HLSL C BrookGPU 3) GPU NVIDIA GPU CUDA Compute Unified Device Architecture 4) C/C++ GPU N 5) 6) GPU GPU FFT 7) 3 3 FFT CUDA FFT O(N log N) O(N) CPU GPU FFT GPU 8),9) 512 MB GPU 256 3 3 FFT 2. NVIDIA CUDA GPU CUDA 231 c 2008 Information Processing Society of Japan

232 CUDA 3 FFT CUDA GPU G80 GeForce 8800 GTX GeForce 8800 GTX 16 MP Streaming Processor SP 8 16 kb 8,192 8800 GTX 128 SP CUDA GPU SP SIMD Single Instruction Multiple Data CUDA 768 32 Warp 24 Warp 1 Warp 32 8 8 SP 1 CUDA GPU PCI-Express 2008 1 CUDA GPU GeForce 8 1 8800 GTS G92 512 MB 8800 GTS 512 G92 65 nm 90 nm G80 3 GPU 8800 GT 8800 GTX SP 8800 GTS 1 GFLOPS FFT SP AMD Phenom 9500 2.2 GHz 4 70.4 GFLOPS 35.2 GFLOPS 8800 GTS 6 GPU multirow FFT 10),11) multirow FFT FFT Table 1 1 NVIDIA GeForce 8 Specifications of NVIDIA GeForce 8 series GPUs. SP Memory Model MP # clock GFLOPS Capacity Interface Clock Bandwidth 8800 GT 14 112 1.500 GHz 336 512 MB 256-bit 1,800 MHz 57.6 GB/s 8800 GTS 16 128 1.625 GHz 416 512 MB 256-bit 1,940 MHz 62.0 GB/s 8800 GTX 16 128 1.350 GHz 345 768 MB 384-bit 1,800 MHz 86.4 GB/s Table 2 2 Number of streams and achieved memory bandwidth. Bandwidth (GB/s) # of streams 8800 GT 8800 GTX 1 48.1 71.7 2 47.6 71.9 4 47.4 71.8 8 36.4 65.4 16 27.8 43.7 32 25.7 37.5 64 20.2 32.3 128 22.6 35.4 256 18.4 30.7 3 FFT 1 FFT GPU SP SIMD multirow FFT 2 64 3 8800 GT 42 8800 GTX 48 128 MB 2 float float2 STREAMS SIMD

233 CUDA 3 FFT global void cuda_memcpy(float *s, float *d) { int index = mul24(blockidx.x,blockdim.x) + threadidx.x; int step = mul24(griddim.x, blockdim.x); int n = 256 * 256 * (256 / STREAMS); float2 src1 = (float2 *)s, dst1 = (float2 *)d; float2 src2 = src1 + n, dst2 = dst1 + n; float2 src3 = src2 + n, dst3 = dst2 + n;... for (int i = index; i < n; i += step) { dst1[i] = src1[i]; dst2[i] = src2[i];... dststreams[i] = srcstreams[i]; } } 10 GB/s 256 2.1 Coalesced Memory Access SP Half-Warp 16 1 4byte 8byte 16 byte 64 byte 128 byte 256 byte GPU GDDR Coalesced Memory Access Warp Coalesced Memory Access CUDA 3. CUDA 3 FFT 3 FFT 3 1 FFT CPU 3 FFT 3 X Y Z X Y Z CUDA GPU Coalesced Memory Access Half-Warp 16 32-bit float 4byte 16 = 64 byte 256 16 kb 1 3.1 3 FFT X CPU 1 FFT Y Z 1 16 kb

234 CUDA 3 FFT multirow FFT multirow FFT 256 FFT multirow FFT 512 + α 1,024 8 Coalescing Memory Access 256 FFT 2 16 FFT 16 FFT multirow FFT 51 52 128 256 FFT 16 FFT 2 8800 GT 8800 GTX 256 FFT multirow FFT 10 GB/s 16 FFT multirow FFT 38 GB/s Y Z 1 FFT 16 FFT 3 FFT 5 1 16 FFT Z 256 FFT 2 16 FFT Z 256 FFT 3 1 Y 4 2 Y 5 X 1 FFT Y Z 16 16 5 V(256,16,16,16,16) WORK Fortran 5 for Y Z 1 3 2 4 COMPLEX V(256,16,16,16,16),WORK(256,16,16,16,16) for Z1,Y2,Y1,X WORK(X,*,Y1,Y2,Z1)=FFT256_1(V(X,Y1,Y2,Z1,*)) for Y2,Y1,Z2,X V(X,Z2,*,Y1,Y2)=FFT256_2(WORK(X,Z2,Y1,Y2,*)) for Y1,Z1,Z2,X WORK(X,*,Z2,Z1,Y1)=FFT256_1(V(X,Z2,Z1,Y1,*)) for Y2,Y1,Z2,X V(X,Y2,*,Z2,Z1)=FFT256_2(WORK(X,Y2,Z2,Z1,*)) for Z1,Z2,Y1,Y2 V(*,Y2,Y1,Z2,Z1)=FFT256(V(*,Y2,Y1,Z2,Z1)) FFT256() 256 FFT FFT256 1() FFT256 2() 256 FFT 16 FFT for Z1,Y2,Y1,X Z1 Y2 Y1 X 4 4 (Z1*Y2*Y1*X) 1 Coalesced Memory Access 1 4 5 256 FFT 16 FFT 5 256 FFT 16 FFT 2 V(X,Y1,Y2,Z1,Z2) V(X,Y2,Y1,Z2,Z1) Stockham 11) Y Z 16 FFT 3 A D 4 Table 3 3 Definition of memory access patterns. A (256,*,16,16,16) B (256,16,*,16,16) C (256,16,16,*,16) D (256,16,16,16,*)

235 CUDA 3 FFT 4 GB/s 8800 GT 42 64 Table 4 Achieved memory bandwidth (GB/s) with each memory access pattern for input and output on 8800 GT. The number of thread blocks is 42, and the number of threads per thread block is 64. Output Input A B C D A 47.4 47.9 46.8 47.1 B 48.2 48.3 46.8 47.1 C 47.3 47.1 34.4 33.3 D 45.6 45.2 32.6 27.8 Table 6 6 The configuration of the system used for performance evaluations. CPU AMD Phenom 9500, 2.2 GHz, Quad-Core Chipset AMD 790FX RAM DDR2-800 SDRAM 1 GB 4 OS Fedora Core 8, linux 2.6.23 Driver NVIDIA Linux driver 169.04 Software CUDA SDK 1.1 CUDA Toolkit 1.1 GCC 4.1.2 5 GB/s 8800 GTX 48 64 Table 5 Achieved memory bandwidth (GB/s) with each memory access pattern for input and output on 8800 GTX. The number of thread blocks is 48, and the number of threads per thread block is 64. Output Input A B C D A 71.5 71.5 67.7 66.8 B 71.3 71.3 67.6 67.0 C 68.7 68.5 51.3 50.4 D 67.5 66.7 50.0 43.7 4 5 C D A B 1 A B 1 C D 5 2 1 4 3 3.2 1 4 5 1 4 16 FFT SP 5 X 256 FFT SP 64 256 FFT 1 256 FFT SP 1 4 64 SP 100% 64 8 SP 5 16 4. GeForce 8800 GT/GTS/GTX 3 FFT 6 790FX 8800 GT/GTS PCI-Express 2.0 8800 GTX PCI-Express 1.1 7 GPU 256 3 3 FFT GFLOPS N 3 3 FFT 15N 3 log 2 N CUFFT3D

236 CUDA 3 FFT Table 8 8 The elapsed time and achieved bandwidth in each step of our implementation. Step 1&3 Step 2&4 Step 5 Total Model Time (ms) GB/s Time (ms) GB/s Time (ms) GB/s Time (ms) GB/s 8800 GT 6.89 38.9 6.78 39.5 7.24 37.0 34.57 38.8 8800 GTS 6.31 42.5 6.29 42.7 6.00 44.7 31.20 43.0 8800 GTX 4.52 59.3 4.84 55.3 6.92 38.7 25.64 52.3 7 CUFFT 256 3 3 FFT Table 7 Performance of 3-D FFT of size 256 3. 9 1 FFT 65,536 256 FFT Table 9 The performance of 65,536 sets of 256-point 1-D FFTs. Our implementation CUFFT3D Model Time (ms) GFLOPS GFLOPS Our Implementation CUFFT1D Model Time (ms) GFLOPS Time (ms) GFLOPS 8800 GT 34.4 58.5 18.6 8800 GTS 31.2 64.6 20.6 8800 GTX 25.3 79.5 23.4 8800 GT 7.24 92.7 13.7 49.0 8800 GTS 6.00 111.8 11.4 58.9 8800 GTX 6.92 97.0 13.2 50.8 NVIDIA CUDA Toolkit 1.1 FFT 3 FFT GeForce 8 3 CUFFT 3.1 3.3 5 3 FFT 8 1 3 2 4 8 1&3 2&4 1 4 16 FFT 1 3 1.0 2 4 5 256 FFT 2 256 FFT 64 256 FFT 4 3 256 FFT 5 SP 8800 GT 8800 GTS 1 4 8800 GTX 1 4 SP SP 8800 GTS 8800 GTX 5 5 65,536 256 1 FFT CUFFT 1 FFT 9 CUFFT SP 8800 GTX 5 GFLOPS 30% 1 180 46 64 18 18 SP 32%

237 CUDA 3 FFT Table 11 11 256 3 3 FFT The performance of 3-D FFT of size 256 3 including the data transfer time between host and device. Host-to-Device 3D FFT on Device Device-to-Host Total Model PCI-Express Time (ms) GB/s Time (s) GFLOPS Time (ms) GB/s Time (ms) GFLOPS 8800 GT 2.0 x16 26.3 5.11 34.4 58.5 26.9 4.99 87.6 23.0 8800 GTS 2.0 x16 25.9 5.18 31.2 64.6 26.2 5.12 83.3 24.2 8800 GTX 1.1 x16 47.5 2.82 25.3 79.5 39.4 3.40 112 17.9 12 FFTW 3.2alpha2 CPU 256 3 3 FFT Table 12 The performance of 3-D FFT of size 256 3 in single precision using FFTW library 3.2alpha2 on CPUs. Processor Clock Socket/Core Compiler Time (ms) GFLOPS AMD Phenom 9500 2.20 GHz 1/4 gcc 4.1.2 195 10.3 Intel Core 2 Quad Q6700 2.66 GHz 1/4 gcc 4.1.2 188 10.7 AMD Opteron 880 2.60 GHz 8/16 Intel Compiler 9.1.45 220 9.15 10 64 3 128 3 256 3 3 FFT GFLOPS Table 10 The performance (GFLOPS) of 3-D FFT of size 64 3, 128 3 and 256 3. CUFFT3D Our implementation Model 64 3 128 3 256 3 64 3 128 3 256 3 8800 GT 5.33 11.5 18.6 37.0 49.1 58.5 8800 GTS 3.80 12.3 20.6 42.2 55.6 64.6 8800 GTX 5.11 15.3 23.4 46.5 67.5 79.5 64 3 128 3 10 GFLOPS 256 3 CUFFT 11 FFT GPU PCI-Express 1 GPU 8800 GTX GPU PCI-Express 2.0 FFT GPU GPU 3 FFT GPU GPU FFT 2 FFT X 12 256 3 3 FFT CPU CPU FFTW 12) 3.2alpha2 OpenMP SSE CPU OpenMP CPU GPU 11 CUDA GPU

238 CUDA 3 FFT 5. GPU CPU CUDA GPU CUDA 3 FFT X Y Z multirow FFT GeForce 8 GPU 256 3 3 FFT NVIDIA CUFFT 3.1 3.3 79.5 GFLOPS ULP-HPC Microsoft Technical Computing Initiative HPC-GPGPU: Large-Scale Commodity Accelerated Clusters and its Application to Advanced Structural Proteomics 1) General-Purpose Computation Using Graphics Hardware. http://www.gpgpu.org/ 2) Mark, W.R., Glanville, R.S., Akeley, K. and Kilgard, M.J.: Cg: A System for Programming Graphics Hardware in a C-like Language, ACM Trans. Graphics (Proc. SIGGRAPH 2003 ), Vol.22, No.3, pp.896 907 (2003). 3) Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M. and Hanrahan, P.: Brook for GPUs: Stream Computing on Graphics Hardware, SIG- GRAPH 04: ACM Transactions on Graphics, Vol.23, No.3, pp.777 786 (2004). 4) NVIDIA CUDA Compute Unified Device Architecture. http://developer.nvidia.com/object/cuda.html 5) Stock, M.J. and Gharakhani, A.: Toward efficient GPU-accelerated N-body simulations, 46th AIAA Aerospace Sciences Meeting and Exhibit, AIAA 2008-608 (2008). 6) Larsen, E.S. and McAllister, D.: Fast Matrix Multiplies using Graphics Hardware, The 2001 ACM/IEEE conference on supercomputing (CDROM ), ACM Press (2001). 7) Cooley, J.W. and Tukey, J.W.: An Algorithm for the Machine Calculation of Complex Fourier Series, Math. Comput., Vol.19, pp.297 301 (1965). 8) Moreland, K. and Angel, E.: The FFT on a GPU, Proc. SIGGRAPH/Eurographics Workshop on Graphics Hardware 2003, pp.112 119 (2003). 9) Govindaraju, N.K., Larsen, S., Gray, J. and Manocha, D.: A Memory Model for Scientific Algorithms on Graphics Processors, The 2006 ACM/IEEE conference on supercomputing (CDROM ), IEEE (2006). 10) Swarztrauber, P.N.: FFT algorithms for vector computers, Parallel Computing, Vol.1, pp.45 63 (1984). 11) Van Loan, C.: Computational Frameworks for the Fast Fourier Transform, SIAM Press, Philadelphia, PA (1992). 12) Frigo, M. and Johnson, S.G.: The Design and Implementation of FFTW3, Proc. IEEE, Vol.93, No.2, pp.216 231 (2005). special issue on Program Generation, Optimization, and Platform Adaptation. ( 20 1 29 ) ( 20 5 1 ) 1976 2001 2004 2007 FFT IEEE-CS 1984 2007 GPU HPC

239 CUDA 3 FFT 1974 2001 2007 GCOE ACM IEEE-CS 1986 1989 1996 2001 4 2002 NAREGI 2006 TSUBAME 1996 1999 2006 JSPS Award ISOTAS 96 ECOOP 97 ISCOPE 99 ACM OOPSLA 02 IEEE CCGrid 03 HPCAsia 2005 Grid 2006 IEEE/ACM Supercomputing 04 Network Track Chair Reflection 01 IEEE CCGrid 06 Global Grid Forum Steering Group