GPU GPU GPU GPU. GPU (Graphics Processing Unit) GPU GPU GPU AGPU [11] AGPU. GPGPU (general-purpose GPU) GPU GPU AGPU GPU

GPU 1,a) 2,) GPU GPU errill Radix [14] SD Radix Splitter-ased 1. GPU Graphics Processing Unit) GPU GPU GPGPU general-purpose GPU) GPU GPU VIDIA GPGPU CUDA[15] CUDA GPU GPU RARandom Access achine) RA RA RA 1 2 a) koike@nii.ac.jp ) sada@mist.i.u-tokyo.ac.jp PRA [6] PRA GPU GPU [13] GPU GPU GPU GPU AGPU [11] AGPU AGPU GPU GPU GPU [2], [7], [8], [10], [12], [14], [16], [17], [18], [19], [21]. errill Radix [14] LSD Radix AGPU 2 SD Radix SD Radix Spliter-ased 1

!!!!! w!! w!!! )! 1! p! AGPU! w!! errill Radix SD errill Radix 2 AGPU 3 errill Radix AGPU 4 SD Radix 5 6 2. AGPU AGPU [11] GPU AGPU GPU AGPU AGPU AGPU GPU 2.1 AGPU 1 AGPU GPU) CPU) p w k 2 Wait%due%to%gloal%memory% access p = k 1 GPU C 2 GPU 2 1 1 2

1 2 ) AGPUp,,, C, w),c,w 2.2 AGPU I/O 1 I/O I/O 2.3 2.1 GPU I/O I/O 1 C GPU 1 C AGPUp,,, C) m c := c/m CUDA AGPU C C 3. Radix errill Radix [14] AGPU 3.1 errill Radix [14] LSD Radix [3] r =2 d k = p/) 4 r =4 3 3 r 1,r 2,r 3,r 4 r 1 <r 2 <r 3 <r 4 3

P1 P2 P3 P4 r 1 r 2 r 3 r 4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 3 3 r 1 r 1 r 2,r 3,r 4 3 1) Bottom-level Reduction 2) Top-level Scan 3) Bottom-level Scan/Scatter 3 16 Bottom-level Reduction 3 16 Harris GPU Reduction Cascading [9] Top-level Scan Bottom-level Reduction Prefix Scan GPU Prefix Scan Tree-ased [20] Bottom-level Scan/Scatter multi-scan multi-scan Dotsenko prefix scan atrix-ased [4] atrix-ased a a )a AGPU [22] multi-scan Dotsenko prefix scan 3.2 Prefix Scan Scatter Scatter Prefix Scan r = a I/O Prefix Scan w w/ log r n p 1 4. SD Radix 3 errill Radix [14] SD Radix SD Radix SD Radix GPU Radix 5 Splitter-ased 4.1 errill Radix [14] errill Radix SD LSD Radix Sort 2 4 4

1 w I/O ) nw nw LSD Radix O O r + log O log r p log r r r P 1 P 2 P 3 P 4 P 21 P 22 P 31 P 32 P 33 4 P 1 P 2 P 3 1) P 1 2) r 1 r 2 r 3 r 4 5 3 3 5 1) 2) 2) 5 1) 2) LSD Radix sort errill LSD Radix 5. Splitter-ased 4 r 1 Aggarwal [1] I/O Distriution Aggarwal Distriution 5.1 Aggarwal Distriution Aggarwal [1] Distriution 1 2 1) 2) Radix 2) SD Radix 1) Aggarwal Distriution I/O I/O 1 1 1 ) I/O ) I/O, ) Aggarwal Distriution n I/O 6 S/4 S S i 4i/S 2 5

[5] S I/O index i rank ) i S/4 rank i) 4i S S 2 4 = i S rank i) < i S + S 4 < S i + 1 4 5 4 S S = ) O log 4 5 S log = O = 4 log 5 O log I/O O I/O O log [1] 5.2 I/O, ) I/O AGPUp,, ) [11] AGPU 1 AGPUp,,, C) 1 S/4 4/S =4 S S i 4i [5] S 1 I/O OS) S I/O S S = / 4 5 4 S = O1) O) log 4S/5 4S I/O 1 I/O O/) O/) ) I/O O log 2 6. errill Radix AGPU 2 SD Radix Splitter-ased SD Radix errill Radix Splitter-aed GPU errill Radix [14] [1] Aggarwal, A. and Vitter, Jeffrey, S.: The input/output complexity of sorting and related prolems, Commun. AC, Vol. 31, o. 9, pp. 1116 1127 online), DOI: 10.1145/48529.48535 1988). [2] Capannini, G., Silvestri, F., Baraglia, R. and ardini, F.: Sorting using itonic network with CUDA, Proceedings of the 7th Workshop on LSDS-IR 2009). [3] Cormen, T. H., Leiserson, C. E., Rivest, R. L. and Stein, C.: Introduction to Algorithms, Third Edition, The IT Press, 3rd edition 2009). [4] Dotsenko, Y., Govindaraju,. K., Sloan, P.-P., Boyd, C. and anferdelli, J.: Fast scan algorithms on graphics processors, Proceedings of the 22nd annual international conference on Supercomputing, ICS 08, ew York, Y, USA, AC, pp. 205 213 online), DOI: 10.1145/1375527.1375559 2008). [5] Floyd, R.: Permuting Information in Idealized Two- Level Storage, Complexity of Computer Computations iller, R., Thatcher, J. and Bohlinger, J., eds.), The IB Research Symposia Series, Springer US, pp. 105 109 1972). [6] Fortune, S. and Wyllie, J.: Parallelism in random access machines, Proceedings of the tenth annual AC symposium on Theory of computing, STOC 78, ew York, Y, USA, AC, pp. 114 118 online), DOI: 10.1145/800133.804339 1978). [7] Govindaraju,., Gray, J., Kumar, R. and anocha, D.: GPUTeraSort: high performance graphics co- 6

S/4 U 1 U 2 4/S 2 S : 4/S 1) / U / U : ) 2 Splitter-ased 6 I/O [11] Distriution Splitter-ased S = ) I/O Ω log Ω p log - ) O log O S + log log O S) ) p S ) O log O p log log O1) processor sorting for large dataase management, Proceedings of the 2006 AC SIGOD international conference on anagement of data, SIGOD 06, ew York, Y, USA, AC, pp. 325 336 online), DOI: 10.1145/1142473.1142511 2006). [8] Greß, A. and Zachmann, G.: GPU-ABiSort: optimal parallel sorting on stream architectures, Proceedings of the 20th international conference on Parallel and distriuted processing, IPDPS 06, Washington, DC, USA, IEEE Computer Society, pp. 45 45 online), availale from http://dl.acm.org/citation.cfm?id=1898953.1898980 2006). [9] Harris,.: Optimizing Parallel Reduction in CUDA 2008). [10] Khorasani, E., Paulovicks, B. D., Sheinin, V. and Yeo, H.: Parallel implementation of external sort and join operations on a multi-core network-optimized system on achip,proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I, ICA3PP 11, Berlin, Heidelerg, Springer-Verlag, pp. 318 325 online), availale from http://dl.acm.org/citation.cfm?id=2075416.2075446 2011). [11] Koike, A. and Sadakane, K.: A ovel Computational odel for GPUs with Application to I/O Optimal Sorting Algorithms, 2014 IEEE 28th International Parallel & Distriuted Processing Symposium Workshops, pp. 614 623 online), DOI: 10.1109/IPDPSW.2014.72 2014). [12] Kolonias, V., Voyiatzis, A. G., Goulas, G. and Housos, E.: Design and implementation of an efficient integer count sort in CUDA GPUs, Concurr. Comput. : Pract. Exper., Vol. 23, o. 18, pp. 2365 2381 online), DOI: 10.1002/cpe.1776 2011). [13] Kothapalli, K., ukherjee, R., Rehman,., Patidar, S., arayanan, P. and Srinathan, K.: A performance prediction model for the CUDA GPGPU platform, High Performance Computing HiPC), 2009 International Conference on, pp. 463 472 online), DOI: 10.1109/HIPC.2009.5433179 2009). [14] errill, D. and Grimshaw, A.: High Performance and Scalale Radix Sorting: A case study of implementing dynamic parallelism for GPU computing, Parallel Processing Letters, Vol. 21, o. 02, pp. 245 272 online), DOI: 10.1142/S0129626411000187 2011). [15] VIDIA Corporation: VIDIA CUDA C Programming Guide version 4.2 2012). [16] Peters, H., Schulz-Hilderandt, O. and Luttenerger,.: Fast in-place sorting with CUDA ased on itonic sort, Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I, PPA 09, Berlin, Heidelerg, Springer-Verlag, pp. 403 410 online), availale from http://dl.acm.org/citation.cfm?id=1882792.1882841 2010). [17] Peters, H., Schulz-Hilderandt, O. and Luttenerger,.: A ovel Sorting Algorithm for any-core Architectures Based on Adaptive Bitonic Sort, Proceedings of the 2012 IEEE 26th International Parallel and Distriuted Processing Symposium, IPDPS 12, Washington, DC, USA, IEEE Computer Society, pp. 227 237 online), DOI: 10.1109/IPDPS.2012.30 2012). [18] Satish,., Harris,. and Garland,.: Designing efficient sorting algorithms for manycore GPUs, Proceedings of the 2009 IEEE International Symposium on Parallel&Distriuted Processing, IPDPS 09, Washington, DC, USA, IEEE Computer Society, pp. 1 10 online), DOI: 10.1109/IPDPS.2009.5161005 2009). [19] Satish,., Kim, C., Chhugani, J., guyen, A. D., Lee, V. W., Kim, D. and Duey, P.: Fast sort on CPUs and GPUs: a case for andwidth olivious SID sort, Proceedings of the 2010 AC SIGOD International Conference on anagement of data, SIGOD 10, ew York, Y, USA, AC, pp. 351 362 online), DOI: 10.1145/1807167.1807207 2010). [20] Sengupta, S., Harris,. and Garland,.: Efficient parallel scan algorithms for GPUs, Technical Report VR- 2008-003, VIDIA 2008). [21] Ye, X., Fan, D., Lin, W., Yuan,. and Ienne, P.: High performance comparison-ased sorting algorithm on many-core GPUs, Parallel Distriuted Processing IPDPS), 2010 IEEE International Symposium on, pp. 1 10 online), DOI: 10.1109/IPDPS.2010.5470445 2010). [22], : AGPU, COP DS-1-13, 2013). 7