C++ 78 (478) A Parallel Skeleton Library in C++ with Optimization

78 (478) C++ BMF C++ Skeletal parallel programming enables programmers to build a parallel program from ready-made components called skeletons (parallel primitives) for which efficient implementations are known to exist, making both the parallel program development and the parallelization process easier. Parallel programs in terms of skeletons are, however, not always efficient, because intermediate data structures which do not appear in the final result may be produced and passed between skeletons. To overcome this problem and make the skeletal parallel programming more practical, this paper proposes a new parallel skeleton library in C++. This system have an optimization mechanism which transforms successive calls of parallel skeletons into a single function call with the help of fusion transformation. This paper describes the implementation of the skeleton library and reports the effects of the optimization. 1 A Parallel Skeleton Library in C++ with Optimization Mechanism. Yoshiki Akashi,, Graduate School of Electro-Communications, The University of Electro-Communications. Kiminori Matsuzaki, Kazuhiko Kakehi,, Graduate School of Information Science and Technology, The University of Tokyo. Hideya Iwasaki,, Department of Computer Science, The University of Electro- Communications. Zhenjiang Hu,, Graduate School of Information Science and Technology, The University of Tokyo. 21, PRESTO 21, Japan Science and Technology Agency., Vol.22, No.4(2005), pp.78 83. [ ] 2005 2 18. [6]

(479) Vol. 16 No. 5 Sep. 1999 79 BMF[3] C++ C++ 2 BMF BMF 2. 1 (f g) x = f (g x) a b = (a ) b = ( b) a = ( ) a b [ ] a [a] [ ] a [a] x ++ y x y [1] ++ [2] ++ [3] [1, 2, 3] [a] ++ x a : x 2. 2 BMF map reduce scan zip 4 map f map f [x 1, x 2,..., x n ] = [f x 1, f x 2,..., f x n ] reduce reduce ( ) [x 1, x 2,..., x n] = x 1 x 2 x n scan reduce e scan ( ) [x 1, x 2,..., x n] = [e, e x 1,, e x 1 x n ] zip 2 1 zip [x 1, x 2,..., x n ] [y 1, y 2,..., y n ] = [(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )] 4 Hu [7][9] accumulate

80 (480) g p q accumulate [ ] e = g e accumulate (a : x) e = p (a, e) accumulate x (e q a) accumulate [g, (p, ), (q, )] 3 C++ MPICH 3. 1 dist_array array dist_array<int> *as = new dist_array<int>(array, size); 1 array 3. 2 2.2 dist_array template<typename B> dist_array<b>* map(b (*f)(const A&)) const; template<typename B> void map(void (*f)(b*, const A*), dist_array<b> *bs) const; void map_ow(a (*f)(const A&)); 1 map map 1. 2. 3. A map 1 1 1 2 3 map_ow map_ow map_ow as f as->map_ow(f); n p O(1) map n/p map O(n/p) reduce

(481) Vol. 16 No. 5 Sep. 1999 81 1 O(n/p) O(log p) O(log p) reduce O(n/p + log p) scan O(n/p) O(log p) O(n/p) scan O(n/p + log p) zip 2 C++ pair zip map O(n/p) 3. 3 as = [a 1, a 2,..., a n ] var var = ave = nx (a i ave) 2 /n i=1 nx a i /n i=1 BMF 2 (a) 2 (b) BMF n a a 1 n n dist_array var as = sqsum/n where sum = reduce (+) as ave = sum/n sqsum = reduce (+) (map square (map ( ave) as)) (a) BMF sum = as->reduce(add); ave = sum / n; as->map_ow(sub_ave); as->map_ow(square); sq_sum = as->reduce(add); var = sq_sum / n; (b) 2... for(int i = 0; i < number; i++){ ave_a[i] = a[i].reduce(add) / size; ave += ave_a[i]; }... for(int i = 0; i < number; i++){ a[i].map_ow(sub_ave); a[i].map_ow(square); } for(int i = 0; i < number; i++) st += a[i].reduce(add);... 3 3 number size 4 map f (map g x) map 2 map (f g) x map 1

82 (482) 4. 1 Hu [8] accumulate cataj buildj cataj buildj (cataj). cataj accumulate p e cataj [ ] = e cataj (a : x) = p a cataj x cataj ([, p, e]) (buildj). buildj buildj gen = gen ( + ) [ ] [ ] cataj append [ ] e [ ] p : p reduce cataj p buildj 3 buildj buildj cataj buildj e p CataJ-BuildJ accumulate cataj buildj id map f = buildj (λc s e. ([c, s f, e])) reduce ( ) = ([, id, e]) scan ( ) x = buildj (λc s e. [[s, (λ(a, e). s e, c), (id, )]]) x e CataJ-BuildJ : ([c, s, e]) buildj gen = gen c s e map reduce cataj reduce ( ) map f = ([, id, e]) buildj (λc s e. ([c, s f, e])) = ((λc s e. ([c, s f, e])) ( ) id e) = ([, f, e]) map f map g BuildJ(CataJ-BuildJ) : buildj (λc s e. ([φ 1, φ 2, φ 3 ])) buildj gen = buildj (λc s e. gen φ 1 φ 2 φ 3) map f map g map f map g = buildj (λc s e. ([c, s f g, e])) fst BuildJ(Acc-BuildJ) : buildj (λc s e. [[g, (p, ), (q, )]]) (buildj gen x) e = fst (buildj (λc s e. gen ( ) f d) x e) where (u v) e = let (r 1, s 1, t 1 ) = u e (r 2, s 2, t 2) = v (e t 1)

(483) Vol. 16 No. 5 Sep. 1999 83 in (s 1 r 2, s 1 s 2, t 1 t 2 ) f a e = (p (a, e) g (e q a), p (a, e), q a)) d e = (g e,, ) 4. 2 OpenC++ [5] cataj buildj OpenC++ 2 map f map g map (f g) 3.3 map_ow reduce [[ 1 as -> sum cataj [[add]] nil nil] ;] [[ave = [sum / size]] ;] [[ 3 as -> as buildj cataj nil [[sub_ave]] nil] ;] [[ 3 as -> as buildj cataj nil [[square]] nil] ;] [[ 1 as -> sq_sum cataj [[add]] nil nil] ;] [[var = [sq_sum / size]] ;] BuildJ(CataJ-BuildJ) CataJ-BuildJ [[ 1 as -> sum cataj [[add]] nil nil] ;] [[ave = [sum / size]] ;] [[ 1 as -> sq_sum cataj [[add]] [[sub_ave] [square]] nil] ; ] [[var = [sq_sum / size]] ;] CPU 1 Pentium4 2.4GHz 512MB 1Gbps OS Linux 2.4.20 g++2.96 MPICH mpich 1.2.6 sum = as->reduce(add); ave = sum / size; sq_sum = as->cataj(_sym11086_2, add); var = sq_sum / size; _sym11086_2 sub_ave square 2 map reduce 1 cataj 2(n/p) 5 3.3 C++ MPI 3 1000 10 100 100 1 10 PC 4 29.7% BuildJ(CataJ-BuildJ) 16.0% CataJ-BuildJ 13.7% 10 8.20 5 7.8%

84 (484) Execution Time (sec) 60 50 40 30 20 10 skeleton optimized skeleton C++ + MPI Execution Time (sec) 120 100 80 60 40 20 skeleton optimized skeleton C++ + MPI 0 1 2 3 4 5 6 7 8 9 10 Number of Processors 0 1 2 3 4 5 6 7 8 9 10 Number of Processors 4 5 C++ MPI 15% 1 6 P3L [2] map reduce scan pipe P3L C C Skil [4] C Skil C HPC++ [10] map reduce scan 1 5% 7% HPC++ P3L Skil C HPC++ C++ C++ 7 BMF C++ map zip reduce scan 2 5 [12] Tree Contraction [1] [11] zip

(485) Vol. 16 No. 5 Sep. 1999 85 [ 1 ] Abrahamson, K., Dadoun, N., Kirkpatrik, D., and Przytycka, T.: A Simple Parallel Tree Contraction Algorithm, Journal of Algorithms, Vol. 10, No. 2 (1989), pp. 287 302. [ 2 ] Bacci, B., Danelutto, M., Orlando, S., Pelagatti, S., and Vanneschi, M.: P3L: A Structured High Level Programming Language and its Structured Support, Concurrency: Practice and Experience, Vol. 7, No. 3 (1995), pp. 225 255. [ 3 ] Bird, R.: An Introduction to the Theory of Lists, Proc. NATO Advanced Study Institute on Logic of Programming and Calculi of Discrete Design, Springer-Verlag, 1987, pp. 5 42. [ 4 ] Botorog, G. and Kuchen, H.: Skil: An Imperative Language with Algorithmic Skeletons for Efficient Distributed Programming, Proc. 5th International Symposium on High Performance Distributed Computing (HPDC-5), IEEE Computer Society Press, 1996, pp. 243 252. [ 5 ] Chiba, S.: OpenC++. http://opencxx.sourceforge.net/. [ 6 ] Cole, M.: Algorithmic Skeletons: A Structured Approach to the Management of Parallel Computation, Research Monographs in Parallel and Distribute Computing, Pitman, 1989. [ 7 ] Hu, Z., Iwasaki, H., and Takeichi, M.: Diffusion: Calculating Efficient Parallel Programs, Proc. 1999 ACM SIGPLAN International Workshop on Partial Evaluation and Semantics-Based Program Manipulation (PEPM 99), 1999, pp. 85 94. [ 8 ] Hu, Z., Iwasaki, H., and Takeichi, M.: An Accumulative Parallel Skeleton for All, Proc. 2002 European Symposium on Programming (ESOP 2002), Lecture Notes in Computer Science 2305, Springer- Verlag, 2002, pp. 83 97. [ 9 ] Iwasaki, H. and Hu, Z.: A New Parallel Skeleton for General Accumulative Computations, International Journal of Parallel Programming, Vol. 32, No. 5 (2004), pp. 389 414. [10] Johnson, E. and Gannon, D.: HPC++: Experiments with the Parallel Standard Template Library, Proc. 11th International Conference on Supercomputing, ACM Press, 1997, pp. 124 131. [11] Miller, G. and Reif, J.: Parallel Tree Contraction and its Application, Proc. 26th Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press, 1985, pp. 478 489. [12] Skillicorn, D.: Parallel Implementation of Tree Skeletons, Journal of Parallel and Distributed Computing, Vol. 39, No. 2 (1996), pp. 115 125.