GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

GPU 1 1 NP number partitioning problem Pedroso CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA C Pedroso Python 323 Python C 12.2 Parallelizing the Number Partitioning Problem for GPUs Tomoya Sato 1 Noriyuki Fujimoto 1 Abstract: The number partitioning problem is an NP-hard problem that can be applied to public-key cryptography and so on. For the problem, Pedroso et al. presented an algorithm based on breadth first search and beam search of a search tree. In this paper, we propose a method to parallelize Pedroso et al. s algorithm for GPUs based on the CUDA architecture. The experimental results on an NVIDIA GeForce GTX 580 GPU and a 2.67GHz Intel Core 2 Duo CPU E7300 show that the CUDA C program based on the proposed method is up to 323 times faster than the serial Python program published by Pedroso et al. Another experiment show that the proposed CUDA C program is up to 12.2 times faster than our serial C program ported from the Python program. 1. number partitioning problem NP [2] [9] Pedroso Pedroso GPU 1 Graduate School of Science, Osaka Prefecture University CUDA 2 Pedroso 3 4 5 GPU [1], [3], [6], [7], [11] 2. 2.1 N A={a 1, a 2,, a N } E A (P ) = a i a i i P i/ P A P {1,, N} discrepancy c 2012 Information Processing Society of Japan 1

A E A (P ) P E A (P ) 0 1 P A perfect partition 2.2 Karmarkar and Karp KK Karmarkar Karp differencing method [5] KK(A) ( 1 ) n v 1,, v n A a i key(v i ) = a i ( 2 ) T = {} // ( 3 ) while 1 ( 4 ) key 2 u v ( 5 ) T (u, v) ( 6 ) v ( 7 ) key(u) = key(u) key(v) ( 8 ) return key T key KK T 2 2.3 complete differencing method KK 2 2 1 2 1 {8, 7, 6, 5, 4} 2 (1)KK 4 4 (2) 0 1 (3) 1 2.4 Pedroso Pedroso α Pinedo [10] α Pedroso KK α BFS(A, α, T ) ( 1 ) E = KK(A) ( 2 ) B = {A} // ( 3 ) while B {} and CPU time used < T ( 4 ) C = {} // ( 5 ) for B β ( 6 ) b = β ( 7 ) r = β b ( 8 ) if b r = 0 or b r = 1 ( 9 ) return b r //{b}, β {b} ( 10 ) if b r //{b}, β {b} β ( 11 ) if b r E // ( 12 ) E = b r ( 13 ) else ( 14 ) C = C β ( 15 ) B = {} ( 16 ) C ={C α } ( 17 ) for C β ( 18 ) b 1 = β ( 19 ) b 2 = β 2 ( 20 ) X = {b 1 b 2 } β {b 1, b 2 } ( 21 ) Y = {b 1 + b 2 } β {b 1, b 2 } ( 22 ) B = B {X, Y } ( 23 ) E = KK(Y ) ( 24 ) if E < E // ( 25 ) E = E ( 26 ) return E 16 21 KK c 2012 Information Processing Society of Japan 2

1 {8,7,6,5,4} Fig. 1 The complete differencing method for {8,7,6,5,4}. 3. 3.1 CUDA Thrust[4] 2.4 17-25 GPU 2 KK 3.2 KK KK shared 2 2 1 KK 2 d 2 1 d q 2 q 1 q d 1 2 A[0], A[1],, A[n 1] d = A[n 1] A[n 2] q n d A[i] < d i i q A[i] < d atomicmin i q atomicmin 1 2 2 2 1 2 3 A[q] 1 A[j + 1] = A[j](q j < n 2) A[j] A[j + 1] A[q] 1 d 4 1 1 KK 1 1 4 KK c 2012 Information Processing Society of Japan 3

4 1 KK Fig. 4 The number of threads per node and the total execution time of KK differencing method. 2 d Fig. 2 Parallel search of the position at which d should be inserted. 3 q 1 d Fig. 3 Parallel insertion of d. 3.3 2.4 6-12 : 18-22 KK 23 3 1 1 1 KK 4-7 KK 3.2 KK 1 8 16 GeForce GTX 580 1 8 1 1536 1 1 1 8 16 8 =128 64 GPU 1 1 1 8 1 shared GeForce GTX 580 shared 48KB 8 1 shared 5 1 16 KK 8 5 8 c 2012 Information Processing Society of Japan 4

5 KK 16 / 8 / KK Fig. 5 Speedup ratio of 16 threads per node to 8 threads per node for KK differencing method. 6 Python GPU Fig. 6 Speedup ratio of our GPU program to the corresponding Python program. KK 1 8 1 1 16 4. CPU 2.67GHz Intel Core2 Duo E7300 GPU NVIDIA GeForce GTX 580 OS Windows 7 Professional 64bit CUDA4.2 Visual Studio 2008 Professional CPU Pedroso Python C Python Pedroso Web [8] Python 2.7 D instance 15 105 10 10 12 14 easy instance hard instance 1 α 10 100 1000 10000 100000 6 Python GPU CPU 60 113 323 7 C GPU 12.2 3 1 GPU CPU GPU 7 C GPU Fig. 7 Speedup ratio of our GPU program to the corresponding C program. 8 GPU 2 α 3 5. Pedroso GPU Python 323 C 12.2 64bit a max O(Na max ) CPU a max N N 1024bit c 2012 Information Processing Society of Japan 5

tems, Prentice-Hall,Englewood Cliffs(1995). [11] J. Sanders and E. Kandrot: CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional (2010). 8 GPU CPU GPU Fig. 8 Ratio of data transfer time between CPU and GPU to the whole execution time of our GPU program. GPU Python 35 [1], : CUDA, (2009). [2] M. R. Garey and D. S. Johnson: Computers and Intractability: A Guide to the Theory of NP- Completeness, W H Freeman (1979). [3] M. Garland and D. B. Kirk: Understanding Throughput- Oriented Architectures, Comm. ACM, 53(11), 58 66 (2010). [4] J. Hoberock and N. Bell: Thrust(online), http://code.google.com/p/thrust/ (2012.01.11). [5] N. Karmarkar and R. M. Karp: The differencing method of set partitioning, Technical report UCB/CSD 82/113, University of California. Berkele, Computer Science Division (1982). [6] D. B. Kirk and W. W. Hwu: Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann (2010). [7] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym: NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, 28(2), 39 55 (2008). [8] J. P. Pedroso: Tree Search for Number Partitioning: An Implementation in the Python language, Internet repository, version 1.0(online), http://www.dcc.fc.up.pt/ jpp/partition (2011.07.05). [9] J. P. Pedroso and M. Kubo: Heuristics and exact methods for number partitioning European Journal of Operational Research, 202, 73 81(2010). [10] M. Pinedo: Scheduling: Theory, Algorithms, and Sys- c 2012 Information Processing Society of Japan 6