Vol. 46 No. SIG 7(ACS 10) May 2005 DGKS PC 10 8 10 14 4.8 Orthogonalization Library with a Numerical Computation Policy Interface Ken Naono, Mitsuyoshi Igai and Hiroyuki Kidachi We propose an orthogonalization library that can execute faster computations under the conditions of required accuracy to control the balance between the orthogonal accuracy of the obtained vectors and the computation time. We design the library of the orthogonalization by race-based method, which execute in the same time the classical Gram-Schmidt, the modified Gram-Schmidt, and the DGKS type Gram-Schmidt, and take the fastest result that satisfies the required accuracy. The experiments of the method for three different types of vectors on PC (Pentium4, 3.2 GHz) show that, the method can achieves better accuracy even in the case that the single orthogonalization fails in over 10 8 orthogonality, and that the method achieves about 4.8 times faster in the best case. 1. PC PC GRID 3 1 Central Research Laboratory, Hitachi, Ltd. LSI Hitachi ULSI Systems Co., Ltd. 1) 6) 2 N 1000 2000 3000 N 7),8) 9) 10) 3 35
36 May 2005 3 2. 2.1 1 v j (j =1,,JMAX) JMAX v j i V (i, j); i =1,..., N N 1 do M = 1, JMAX call single_orthogonal(v(1,m), V, M) 1 single orthogonal 1 3 (1) Classical Gram- Schmidt CGS (2) Modified Gram- Schmidt MGS ( 3 ) Daniel-Gragg-Kaufman-Stewart DGKS 11) 3 2.1.1 CGS 2 CGS 2 BLAS2 2 s 2 subroutine single_orthogonal(w,v,m) do j = 1, M-1 s = 0.0d0 do i = 1, N s = s + V(i, j)*w(i) hr(j) = s do i = 1, N s = 0.0d0 do j = 1, M-1 s = s + V(i, j)*hr(j) w(i) = w(i) s 2.1.2 MGS 3 MGS s w(i) = w(i) s*v(i, j) BLAS BLAS1 3 subroutine single_orthogonal(w,v,m) do j=1, M-1 s = 0.0d0 do i =1, N
Vol. 46 No. SIG 7(ACS 10) 37 s = s + V(i, j)*w(i) if (s.ne. 0.0d0) then do i=1, N w(i) = w(i) s*v(i, j) endif 2.1.3 Daniel-Gragg-Kaufman-Stewart DGKS 4 DGKS CGS (1) CGS (1) CGS 4 CGS (1) CGS w 2 <η Vw 2 (1) Vw 2 hr(i) 3 12) η = 1 2 2.2 1 3 MGS CGS 12) MGS CGS ARPACK 13) DGKS 1 3. PC 3 3.1 3 v j (j =1,,JMAX) v j (i) x(k) cos 3 v j i 1 v j (i) =x(k) j + cos( i j )+i 0.01 N +1 2 v j (i) =x(k)+i j 0.01 3 v j (i) =x(k)+cos( i j N +1 ) i =1,..., N k = i +(j 1) N JMAX 20 100 (2) Ortho Ortho = V t V I F (2) I F V j v j
38 May 2005 1 1 CGS MGS DGKS Fig. 1 Execution time and orthogonality error of CGS, MGS and DGKS for Sample 1. JMAX 3.2 N 10000 100000 100 Hitachi FLORA370DG CPU Pentium 4 HT 3.2 GHz L1 8KB L2 512 KB512 MB DDR-SDRAM Intel FORTRAN Compiler Version 7.1 -O3 -prefetch -unroll -align -lowercase -nodps -fpp2 -tpp7 -xw -vec_report3 -opt_report 3.2.1 1 1 1 CGS MGS DGKS 1(1) (3) N = 10000 30000 MGS CGS DGKS 30000 CGS MGS DGKS DGKS CGS 2 1(2) (4) CGS 10 8 MGS DGKS 10 13 JMAX = 20 CGS N 3 3.2.2 2 2 2 CGS MGS DGKS 2 (1) (3) 1 N = 10000 30000 MGS N = 30000 CGS N 2 (2) (4) CGS N 10 MGS DGKS
Vol. 46 No. SIG 7(ACS 10) 39 2 2 CGS MGS DGKS Fig. 2 Execution time and orthogonality error of CGS, MGS and DGKS for Sample 2. 3 3 CGS MGS DGKS Fig. 3 Execution time and orthogonality error of CGS, MGS and DGKS for Sample 3. 3.2.3 3 3 3 CGS MGS DGKS 3 (1) (3) 1 2 N = 10000 30000 MGS N = 30000 CGS N 3 (2) (4) CGS MGS DGKS CGS N 3 10000 20000
40 May 2005 MGS 40000 80000 CGS < MGS < DGKS CGS CGS > MGS > DGKS DGKS 4. 4.1 Policy(ε) Ortho ε Policy(ε) ε 4.2 Policy(ε) 10) ε 4 CGS MGS DGKS 3 4 (1) ε (2) 3 (a) CGSMGS DGKS (b) 3 (c) Ortho ε 4 Policy(ε) Fig. 4 Processing flow of orthogonalization library with numerical policy Policy(ɛ) using race-based method. ε Ortho Ortho ε Ortho (3) 4.3 JMAX = 100 N = 10000 20000 40000 80000 4 ε =10 8 10 10 10 12 3 3 4 3 3.2 1 1 3 5 6 8 CGS MGS DGKS NG
Vol. 46 No. SIG 7(ACS 10) 41 1 1 JMAX = 100 Table 1 Accuracy and execution time of selected algorithm and of other algorithms for Sample 1 data (JMAX = 100). (Ortho) (sec) CGS MGS DGKS 10000 1.00E-08 MGS 7.70E-11 0.2247 NG(7.43E-06) 4.331 1.00E-10 MGS 7.70E-11 0.2247 NG(7.43E-06) 4.331 1.00E-12 DGKS 2.63E-14 0.9731 NG(7.43E-06) NG(7.70E-11) 20000 1.00E-08 MGS 2.17E-10 0.5695 NG(4.49E-05) 3.644 1.00E-10 DGKS 3.53E-14 2.0752 NG(4.49E-05) NG(2.17E-10) 1.00E-12 DGKS 3.53E-14 2.0752 NG(4.49E-05) NG(2.17E-10) 40000 1.00E-08 MGS 6.34E-10 2.3134 NG(1.86E-04) 1.809 1.00E-10 DGKS 5.94E-14 4.1852 NG(1.86E-04) NG(6.34E-10) 1.00E-12 DGKS 5.94E-14 4.1852 NG(1.86E-04) NG(6.34E-10) 80000 1.00E-08 MGS 1.67E-09 5.3774 NG(1.08E-03) 1.552 1.00E-10 DGKS 7.45E-14 8.3461 NG(1.08E-03) NG(1.67E-09) 1.00E-12 DGKS 7.45E-14 8.3461 NG(1.08E-03) NG(1.67E-09) 2 2 JMAX = 100 Table 2 Accuracy and execution time of selected algorithm and of other algorithms for Sample 2 data (JMAX = 100). (Ortho) (sec) CGS MGS DGKS 10000 1.00E-08 MGS 8.08E-14 0.2344 2.130 4.188 1.00E-10 MGS 8.08E-14 0.2344 2.130 4.188 1.00E-12 MGS 8.08E-14 0.2344 2.130 4.188 20000 1.00E-08 MGS 2.29E-13 0.5801 1.782 3.533 1.00E-10 MGS 2.29E-13 0.5801 1.782 3.533 1.00E-12 MGS 2.29E-13 0.5801 1.782 3.533 40000 1.00E-08 CGS 8.05E-10 2.0653 1.117 1.985 1.00E-10 MGS 6.67E-13 2.3077 NG(8.05E-10) 1.776 1.00E-12 MGS 6.67E-13 2.3077 NG(8.05E-10) 1.776 80000 1.00E-08 CGS 3.22E-09 4.0741 1.320 1.978 1.00E-10 MGS 3.22E-12 5.3763 NG(3.22E-09) 1.499 1.00E-12 DGKS 8.00E-14 8.0574 NG(3.22E-09) NG(3.22E-12) 3 3 JMAX = 100 Table 3 Accuracy and execution time of selected algorithm and of other algorithms for Sample 3 data (JMAX = 100). (Ortho) (sec) CGS MGS DGKS 10000 1.00E-08 MGS 4.61E-14 0.2044 2.446 4.817 1.00E-10 MGS 4.61E-14 0.2044 2.446 4.817 1.00E-12 MGS 4.61E-14 0.2044 2.446 4.817 20000 1.00E-08 MGS 7.34E-14 0.5163 2.051 4.023 1.00E-10 MGS 7.34E-14 0.5163 2.051 4.023 1.00E-12 MGS 7.34E-14 0.5163 2.051 4.023 40000 1.00E-08 CGS 2.61E-12 2.0995 1.064 1.973 1.00E-10 CGS 2.61E-12 2.0995 1.064 1.973 1.00E-12 MGS 9.72E-14 2.2348 NG(2.61E-12) 1.853 80000 1.00E-08 CGS 5.99E-12 4.1279 1.271 1.979 1.00E-10 CGS 5.99E-12 4.1279 1.271 1.979 1.00E-12 MGS 1.30E-13 5.2472 NG(5.99E-12) 1.557-1 2 3 3 1 2 3 1 1 CGS
42 May 2005 N = 10000 20000 MGS DGKS N = 10000 MGS DGKS 4.3 2 3 10000 20000 MGS DGKS 4.8 10 14 2 80000 CGS MGS DGKS 3 3 DGKS CGS 10 10 DGKS 2.0 5. GRID 3 1 2) 4) 8) Dongarra SANS Self- Adapting Numerical Software 5) GRID ATLAS 6) PC 2 3 3 6. 6.1 CGS MGS DGKS 3 PC Pentium4 3.2 GHz HT 3 10000 20000 MGS 40000 80000 CGS < MGS < DGKS CGS CGS > MGS > DGKS DGKS ε Ortho ε CGS MGS DGKS 3 MGS 10 10 10 12 DGKS CGS 10 10 DGKS 4.8 CGS MGS DGKS 3 6.2 PC 1CPU
Vol. 46 No. SIG 7(ACS 10) 43 1) 2001-HPC-87 (SWoPP2001), pp.25 30 (2001). 2) Kuroda, H., Katagiri, T. and Kanada, Y.: Knowledge Discovery in Auto-tuning Parallel Numerical Library, Progress in Discovery Science, Final Report of the Japanese Discovery Science Project, LNCS 2281, pp.628 639, Springer (2002). 3) ABC- LIB http://www.abc-lib.org/ 4) Katagiri, T., Kise, K., Honda, H. and Yuba, T.: FIBER: A General Framework for Auto- Tuning Software, 5th International Symposium on High Performance Computing (ISHPC-V ), LNCS 2858, pp.146 159, Springer (2003). 5) Dongarra, J.J. and Eijkhout, V.: Self-adapting numerical software for next generation applications, International Journal of High Performance Computing Applications, Vol.17, No.2, pp.125 131 (Summer 2003). 6) ATLAS project. http://www.netlib.org/atlas/index.html 7) 2002- HPC-91 (SWoPP2002), pp.49 54 (2002). 8) SACSIS (Symposium on Advanced Computing Systems and Infrastructures) 2003, pp.145 152 (2003). 9) GRID Vol.45, No.SIG6 (ACS6), pp.105 112 (2004). 10) 2004-HPC-99 (SWoPP2004), pp.97 102 (2004). 11) Daniel, J., Gragg, W.B., Kaufman, L. and Stewart, G.W.: Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR factorization, Mathematics of Computation, Vol.30, pp.772 795 (1976). 12) Lehoucq, R.: Analysis and Implementation of an Implicitly Restarted Arnoldi Iteration, Ph.D. thesis, Rice University, Houston, TX (1995). 13) ARPACK Homepage. http://www.caam.rice.edu/software/arpack/ ( 16 9 28 ) ( 17 1 17 ) 1968 1994 SR2201 SR8000 SR11000 HPC HPC 1963 1987 LSI 1977 2000 LSI