SMP OSCAR ; ; ; IT2 OSCAR SMP OSCAR SMP SPEC CPU95 FP OSCAR 6 IBM RegattaH MGRID 0.6 8 IBM RS6000 604e High Node HYDRO2D 8.5 Sun V880 4 SWIM 6.0 Performance of OSCAR Multigrain Parallelizing Compiler on SMPs MOTOKI OBATA ;,KAZUHISA ISHIZAKA ;,JUN SHIRAKO and HIRONORI KASAHARA ; Waseda University Advanced Parallelizing Compiler Project This paper describes OSCAR multigrain parallelizing compiler which has been developed in Japanese Millennium Project IT2 Advanced Parallelizing Compiler and its performance on SMP machines. The compiler realizes multigrain parallelization for chip-multiprocessors to high-end servers to hierarchically exploit coarse grain task parallelism among loops, subroutines and basic blocks and near fine grain parallelism among statements inside a basic block in addition to loop parallelism. Also, it globally optimizes cache use over different loops, or coarse grain tasks, based on data localization technique to reduce memory access overhead. Performance of OSCAR compiler for SPEC95fp is evaluated on different SMPs. For example, it gives us 0.6 times for MGRID on 6 processor IBM RegattaH, 8.5 times speedup for HYDRO2D on 8 processor IBM RS6000 604e High Node against sequential processing and 6.0 times speedup for TOMCATV using 4 processors on Sun Fire V880 server. PDA Do-all Do-across Polaris ) SUIF 2) OSCAR 3) 4) 5) NANOS 6) PROMIS 7) IT2 OSCAR NEDO APC 8) 2000 OSCAR 6 IBM pseries690 Regatta H 8 IBM RS6000 SP 604e High Node SUN Fire V880 SMP OSCAR OpenMP OSCAR OpenMP OpenMP OpenMP
Sequential Fortran OpenMP Fortran 2 3 Front END 4 5 2 3 Intermideate Language 6 7 3 4 6 8 OSCAR Back END OSCAR machine code Middle Path Multi Grain Parallelization Coarse Grain Parallelism Loop Parallelism Near Fine Grain Parallelism Generation Static Scheduling Data Localization scheme VPP Back End VPP Fortran Intermideate Language OpenMP Back End OpenMP Fortran STA MPI Back End Fortran + STAMPI (MPI2) Ultra SPARC Back END SPARC machine code : OSCAR 2 OSCAR OSCAR OSCAR Fortran UltraSPARC OSCAR 9) OpenMP API SMP 0) MPI SMP 3 4) ) 2) OpenMP 0) 2. OSCAR BPA RB SB 3 RB 3 4 8 9 0 2 Data Dependency Control Flow Conditional Branch (a) Macro Flow Graph (MFG) 4 5 7 0 9 2 Data Dependency Extended Contorol Dependency Conditional Branch AND OR Original Control Flow (b) Macro Task Graph (MTG) 2: 2.2 2 a MFG 2 a 2 b MTG AND OR 2.3 MTG 3) OSCAR 2
6 2 3 4 5 (a) befor loop decomposition 7 3 2 5 4 6 2 7 0 9 8 4 3 5 6 DLG DLG0 DLG2 DLG3 (b) after loop aligned decomposition 3: 50% 3) 2.4 ) LAD ) DLG ) 3 a 2 3 7 LAD 4 3 a 2 3 b 2ο5 3 b 2,6,3 3,7,4 4,8,5 5,9,6 2.5 _3_2!$OMP SECTION 2nd Layer MTG_3 _3 3_3 _3_4 st Layer MTG MT DOALL MT 3 RB MT 2 SB _2 2_2 2nd Layer _2_5 _2_3 4: MT (RB) MT 2(RB) MT 3(RB) MT 4(RB) (partial loop) (partial loop) (partial loop) (partial loop) DO Thread 0 Thread Thread 2 END DO group0 MT_3(RB) group0_0 DO MT_3_ (MT_3_a) (MT_3_b) MT_3_2 (MT_3_2a) (MT_3_2b) EndMT MT_ (parallelizable loop) 2nd layer thread groups group0_ Thread 3 MT_3_ (MT_3_a) (MT_3_b) MT3_2 (MT_3_2a) (MT_3_2b) END DO EndMT First layer : MT, MT2, MT3 : static 2nd Layer : MT2_, 2_2,... : centralized dynamic : MT3_, 3_2,... : disributed dynamic 3rd Layer : MT3_a, 3_b,... : static st layer thread group Cenralized Scheduler group MTG_2 _2_6 _2_4 Thread 4 Thread 5 Thread 6 Thread 7!$OMP SECTION!$OMP SECTION!$OMP SECTION!$OMP SECTION!$OMP SECTION!$OMP SECTION!$OMP SECTION MT_2_ MT_2_2 MT_2_3 MT_2_4 MT_2(SB) MT_2_ MT_2_2 MT_2_3 MT_2_4 MT_2_ MT_2_2 MT_2_3 MT_2_4 EndMT EndMT EndMT 5: 8 OpenMP OSCAR 2 2.6 OpenMP OpenMP 4 8 5 5 OpenMP PARALLEL SECTIONS 8 fork join 3
8 4 2 3 0 2 5 OpenMP SECTION 4 2 4 5ο7 2 2 2 2 3 3 3 ο 3 4 4 0 2 0 0 0 5 3 SPEC95fp IBM Regatta H 6 IBM RS6000 SP 604e High Node 8 Sun Fire V880 8 4 OSCAR SPEC95fp TOMCATV, SWIM, SU2COR, HYDRO2D, MGRID, APPLU, TURB3D 7 SU2COR TURB3D OSCAR V880 SWIM SMP native OS 3. IBM Regatta H IBM Regatta H IBM Regatta H 8 Power4 6 SMP Regatta H IBM XL Fortran for AIX Version 7. OSCAR OSCAR OpenMP XL Fortran XL Fortran -O5 -qhot -qarch=pwr4 -O5 -qsmp=auto -qhot -qarch=pwr4 OSCAR 6: Regatta 6PE OpenMP -O5 -qsmp=noauto -qhot -qarch=pwr4 SU2COR TURB3D -O4 6 6 RegattaH OSCAR XL Fortran XL Fortran XL Fortran OSCAR PE PE Regatta OSCAR XL Fortran MGRID OSCAR 6 0.6 XL Fortran 5.7 TOMCATV 5 4.0 SWIM 6 0.3 SU2COR 5 3.3 HYDRO2D 3 8.6 APPLU 6.9 TURB3D 7 2.2 OSCAR SWIM XL Fortran 6 5.7 TOMCATV 5 2.8 SWIM 6 4. SU2COR 5 2.9 HYDRO2D 3 5.4 APPLU 6.7 TURB3D 7 2.2 XL Fortran TOMCATV SWIM HYDRO2D MGRID OS- CAR XL Fortran OSCAR Regatta XL Fortran TOMCATV 4 CPU 58% 3 4% SU2COR TURB3D 4
7: RS6000 604e 8PE PC OSCAR SU2COR LOOPS DO400 TURB3D TURB3D TURB3D APPLU BUTS BLTS OSCAR XL Fortran OSCAR 6.7 3.2 RS6000 SP 604e High Node IBM RS6000 SP 604e High Node 8 SMP RS6000 IBM XL Fortran for AIX Version 7. XL Fortran -O5 -qhot -qarch=ppc -qtune=auto -qcache=auto XL Fortran -O5 -qsmp=auto -qhot -qarch=ppc -qtune=auto - qcache=auto OSCAR OpenMP -O3 -qsmp=noauto -qhot -qarch=ppc -qtune=auto -qcache=auto 7 RS6000 SP 6 7 OSCAR 8 XL Fortran 8 OSCAR 8 7 OSCAR XL Fortran HYDRO2D OSCAR 8.5 XL Fortran 3.7 TOMCATV 5.9 XL Fortran 3.5 SWIM 8.6 XL Fortran.8 SU2COR 4.3 XL Fortran.7 MGRID 8: V880 4 : V880 L2 % program Forte OSCAR tomcatv 20.9 3.8 swim.3 2.2 hydro2d 4.5.3 mgrid 3.9 3.0 applu 7.9 7.2 turb3d 0.6 5.9 6.3 XL Fortran 2. APPLU.7 XL Fortran.2 TURB3D 3.3 XL Fortran 3.3 Regatta OSCAR XL Fortran 3.3 V880 Sun V880 UltraSPARCIII 8 SMP Forte 6 Update 2 -fast Forte -fast -parallel -reduction -stackvar OSCAR OpenMP Forte -fast -mp=openmp -explicitpar -stackvar V880 TOMCATV SWIM V880 V880 Sun Ultra80 8 Sun V880 4 TOMCATV OSCAR 4.0 4 Forte.9 SWIM 6.0 HYDRO2D 3. MGRID 3.2 APPLU.5 TURB3D 2.3 V880 OSCAR Forte SWIM.5 HYDRO2D.4 MGRID. APPLU.0 TURB3D 2. 5
V880 4 Regatta RS6000 TOMCATV SWIM TURB3D 3 TOMCATV SWIM V880 4 Sun Ultra80 L2 TOMCATV Forte L2 20.9%OSCAR L2 3.8% SWIM Forte.3% L2 OSCAR 2.2% 8 Forte OSCAR HYDRO2D MGRID APPLU TURB3D Forte L2 3. OSCAR 2 OSCAR 3 SMP OpenMP SMP 4 IT2 NEDO APC OSCAR OSCAR SMP SMP SPEC95fp 7 6 SMP IBM pseries690 Regatta H 8 SMP IBM RS6000 SP 604e High Node 8 SMP Sun V880 3 SMP Regatta MGRID XL Fortran 5.7 RS6000 HYDRO2D XL Fortran 3.7 V880 TOMCATV Forte 6 Update 2.9 OSCAR OpenMP IBM 2 SMP Sun OpenMP NEDO IT2 http://www.apc.waseda. ac.jp No.2000A-54 [] Eigenmann, R., Hoeflinger, J. and Padua, D.: On the Automatic Parallelization of the Perfect Benchmarks, IEEE Trans. on parallel and distributed systems, Vol. 9, No. (998). [2] Hall, M. W., Anderson, J. M., Amarasinghe, S. P., Murphy, B. R., Liao, S.-W., Bugnion, E. and Lam, M. S.: Maximizing Multiprocessor Performance with the SUIF Compiler, IEEE Computer (996). [3],,,, : OSCAR,, Vol. 35, No. 4, pp. 53 52 (994). [4],, : Fortran,, Vol. J73-D-I, No. 2, pp. 95 960 (990). [5] Kasahara, H., Honda, H. and Narita, S.: Parallel Processing of Near Fine Grain Tasks Using Static Scheduling on OSCAR, Proc. IEEE ACM Supercomputing 90 (990). [6] Ayguade, E., Martorell, X., Labarta, J., Gonzalez, M. and Navarro, N.: Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study, ICPP 99 (999). [7] Brownhill, C. J., Nicolau, A., Novack, S. and Polychronopoulos, C. D.: Achieving Multi-level Parallelization, Proc. of ISHPC 97 (997). [8] : http://www.apc.waseda.ac.jp/. [9] Kimura, K. and Kasahara, H.: Near Fine Grain Parallel Processing Using Static Scheduling on Single Chip Multiprocessors, Proc. IWIA 99, IEEE press (999). [0],, :,, Vol. 42, No. 4 (200). [] Yoshida, A., Yagi, S. and Kasahara, H.: A Data Localization Scheme for Coarse Grain Task Parallel Processing on Shared Memory Multiprocessors, Proc. of IEEE International Workshop on Advanced Compiler Technology for High Performance and Embedded Systems, pp. 8 (200). [2] Ishizaka, K., Obata, M. and Kasahara, H.: Coarse Grain Task Parallel Processing with Cache Optimization on Shared Memory Multiprocessor, Proc. 4th LCPC200 (200). [3],,,,, :, ARC2002-48-4 (2002). 6