Vol. 3 No. 3 189 198 (Sep. 2010) 1 1 1 2 2 Arnold Folk Winther A Fast Finite Element Electromagnetic Analysis on Multi-core Processer System Takeshi Mifune, 1 Yu Hirotani, 1 Takeshi Iwashita, 1 Toshio Murayama 2 and Hideki Ohtani 2 This paper presents a fast finite element (FE) electromagnetic analysis on multicore processer systems. A geometric multigrid method is used to solve the linear equations arising from the FE formulation. A special ordering technique, based on the concept of block multicolor ordering, is proposed for parallelization of the Arnold, Folk and Winther s smoother. Numerical tests show a good parallel performance of the presented multigrid solver with an appropriate adjustment of the number of colors and block size. 1 Kyoto University 2 Sony Corporation 1. 1) PC E 1) 2),3) 4) 6) 1 3) 5) curl 7) 10) 7) AFW Arnold Fork Winther AFW 189 c 2010 Information Processing Society of Japan
190 AFW 11) 14) OpenMP 2. 2.1 E ( [ν( E)] ω 2 ɛ + σ ) E = iωj 0 (1) iω 1) i ω σ ɛ ν J 0 (1) 1) 2.2 IC Incomplete Cholesky G m M A mx m = b m (2) 1 m M A m x m b m A m C Nm Nm, (3) x m C Nm, (4) b m C Nm (5) N m G m m =1 N 1 >N 2 >... > N M 2 Ω m Ω m+1 Ω m Ω m+1 V 5) G m v m MGV(A m, b m, v m) (1) m = M A M x M = b M v M m M (2) (2) S pre(a m, b m, v m) (3) b m+1 R m(b m A mv m), v m+1 0 (4) V MGV(A m+1, b m+1, v m+1) (5) v m v m + P mv m+1 (6) S post(a m, b m, v m) (1) G M N M (2) (6) S pre S post v m (3) (5) R m C N m+1 N m P m C Nm N m+1 G m G m+1 G m+1 G m R m P m 2 R m P m 6),15) G m G 1 A m m >1 5) A m+1 = R ma mp m (6)
191 MGV(A 1, b 1, v 1) 1 COCR Conjugate Orthogonal Conjugate Residual 16) COCR 1 S pre S post AFW COCR COCR 2.3 AFW S pre S post AFW 7),10), 1 Hiptmair 8) AFW G m Ψ m Ω m i Ψ m Ω m,i Ω m 1 G m AFW i Ψ m Ω m,i 1 Ω m,i Fig. 1 Example of node-based patch Ω m,i. 1 10) 3. OpenMP OpenMP COCR COCR MGV A m m >1 (6) MGV R m P m MGV S pre S post AFW MGV G M N m OpenMP 3.1 AFW AFW AFW 3),11) AFW
192 2 Ω m,i (= Ω T m,i ) Fig. 2 Ω m,i (= Ω T m,i ) when using the brick elements. AFW [ ] i [ ] ij i ij Ω m,i = {j Ω m : k Ω m,i, [A m] kj 0} (7) Ω T m,i = {j Ω m : k Ω m,i, [A m] jk 0} (8) Ω m,i = Ω T m,i [A m] ij 0 i j Ω m,i = Ω T m,i 2 AFW 2 m (a) Ω i [v] j j Ω i [b] j j Ω i [v] j j Ω i (b) r = b Av i (1) Ω i [v] j j Ω i [r] j j Ω i (2) [r] j j Ω i j Ω T i [v] j j Ω i (a) (b) i 1 i 2 i 1 i 2 i 1 i 2 3 l x = l y = l z =2 Fig. 3 Example of the multicolor ordering (l x = l y = l z =2). (a) (b) (a) Ω i1 Ω i2 Ω i1 Ω i2 (b) Ω i1 Ω i2 Ω T i 1 Ω T i 2 Ω i1 Ω T i 2 (i, j, k) i j k =0, 1,... x y z color(i, j, k) = mod(i, l x)+l x mod (j, l y)+l xl y mod (k, l z) (9) l x l y l z x y z l x l y l z l xl yl z 3 l x = l y = l z =2 1 2 AFW (a) l x l y l z 2 (b) l x l y l z 3 A m CRS Compressed Row Storage 2) (a) CCS Compressed Column Storage 2) (b) CRS (a) (b) MGV(A 1, b 1, v 1) v 1 (3) v m+1 0 (2) S pre v m =0 S pre (b) (b) (3) b m A mv m (b)
193 (a) S post (a) (a) (b) (9) l x l y l z 3 3.2 12),13) AFW AFW AFW x y z m x m y m z (i, j, k) color(i, j, k) = mod( i/m x,l x)+l x mod ( j/m y,l y)+l xl y mod ( k/m z,l z) (10) (10) m x = m y = m z =1 4 l x = l y = l z = m x = m y = m z =2 12),13) 2 (b) l x l y l z 2 (10) nx +1 ny +1 nz +1 p = l x m x l y m y l z m z n x n y n z x y z 4. PC 1 Fortran 95/2003 Intel Visual Fortran 11.0 /O2 /Qopenmp/Qparallel ( ) < 10 10 3 AFW l x = l y = l z = l m x = m y = m z = m 4.1 5 6 2 1 1/8 12 PML Perfectly Mached Layer 1) x y z x y z 48 (11) 1 Table 1 Computers. 4 l x = l y = l z = m x = m y = m z =2 Fig. 4 Example of the block multicolor ordering (l x = l y = l z = m x = m y = m z =2). PC1 PC2 OS Windows XP Windows XP professional x64 professional x64 CPU Intel Core 2 Extreme Intel Core i7 QX9770 3.2 GHz 950 3.06 GHz 4 4 L2: 6 MB 2 L2: 256 KB 4 L3: 8 MB 8GB 12GB
194 5 1 Fig. 5 Test model 1. Fig. 7 7 Comparizon between the FEM and analytical solution. Table 2 2 Performance comparison of the preconditioners. 6 Fig. 6 2 [mm] Test model 2 [mm]. IC 1 24 2855 (13.5) (574.4) 2 15 18728 (187.4) (45732.0) 1 2 PC1 PC2 (s) 345,744 4.6 1 r z E z 7 PML 2 6 2 x y z 128 6,390,144 4.2 IC 1 2 IC COCR 2 1 2 PC1 PC2 AFW m 1 l 2 IC E φ E-φ 1) IC E E-φ 17) 2.2
195 4.3 AFW AFW 3 4 1 2 PC1 PC2 3 4 m 1 m >1 m =1 2 4.4 1 2 PC1 PC2 5 6 1 4 2 8 Core i7 3.1 l =2 m =1 AFW p (11) G 1 N thread p>10n thread (12) 5 6 3 1 Table 3 Number of iterations and solution time in sequential computing (Test model 1). l 3 m 1 2 3 5 10 15 2 3 42 22 21 22 23 23 (28.9) (14.9) (13.3) (13.1) (13.3) (12.9) 3 3 30 22 23 23 23 23 (23.7) (15.2) (14.4) (14.3) (13.0) (12.9) 4 3 27 22 20 22 23 23 (23.3) (15.2) (13.3) (13.1) (12.8) (12.8) 5 3 24 22 23 22 24 23 (21.3) (14.5) (13.7) (12.5) (13.1) (12.8) 6 3 25 21 22 23 24 23 (20.0) (12.8) (12.7) (12.9) (13.1) (12.8) PC1 (s) 24 13.5 (s) 4 2 Table 4 Number of iterations and solution time in sequential computing (Test model 2). l 3 m 1 2 3 5 10 15 2 3 15 16 15 16 16 16 (242.5) (221.6) (196.9) (193.0) (183.7) (180.2) 3 3 19 15 16 15 17 17 (313.8) (216.7) (209.4) (185.7) (191.7) (187.6) 4 3 20 16 15 16 17 15 (340.7) (231.4) (203.0) (196.1) (192.5) (172.4) 5 3 17 17 15 15 16 16 (307.9) (244.8) (204.9) (187.8) (184.8) (180.5) 6 3 17 17 17 15 16 16 (313.5) (245.1) (223.7) (186.8) (183.5) (179.7) PC2 (s) 15 187.4 (s) (12) (12) (12) /
196 5 4 1 Table 5 Number of iterations and solution time in parallel computing (Four threads, test model 1). l 3 m 1 2 3 5 10 15 2 3 22 21 22 23 23 ( ) (8.1) (7.5) (7.1) (7.2) (7.8) 3 3 30 22 23 23 23 23 (14.7) (8.5) (8.3) (7.4) (8.1) (10.2) 4 3 27 22 20 22 23 23 (13.5) (8.0) (7.4) (7.2) (8.8) (11.4) 5 3 24 22 23 22 24 23 (10.1) (7.9) (7.3) (7.1) (11.5) (11.3) 6 3 25 21 22 23 24 23 (9.2) (6.9) (6.9) (8.5) (11.6) (12.8) PC1 (s) 24 13.5 (s) (12) 6 8 2 Table 6 Number of iterations and solution time in parallel computing (Eight threads, test model 2). l 3 m 1 2 3 5 10 15 2 3 16 15 16 16 16 ( ) (68.1) (60.5) (61.1) (60.3) (63.2) 3 3 19 15 16 15 17 17 (96.6) (65.5) (64.4) (58.9) (64.4) (69.0) 4 3 20 16 15 16 17 15 (106.7) (70.0) (61.9) (62.5) (67.7) (68.9) 5 3 17 17 15 15 16 16 (96.8) (74.1) (62.0) (60.2) (68.0) (78.6) 6 3 17 17 17 15 16 16 (98.9) (74.5) (69.0) (60.6) (70.8) (90.6) PC2 (s) 15 187.4 (s) (12) AFW 1 l =2 m =5 2 l =2 m =10 7 2 Table 7 Effect of hyper-threading technology (Test model 2). ON 8 16 (60.3) OFF 4 16 (68.1) PC2 (s) l 3 =2 3 m =10 1.9 3.1 4.5 2 l = 2 m = 10 ON/OFF 7 3.1 ON 2.8 OFF 4.6 1 PC1 PC2 8 x y z 48 64 96 3 345,744 811,200 2,709,792 PC2 8 1 4.4 2 3 G 1 (12) 8 2 8 8 8 PC2 4
197 14) 8 1 Fig. 8 Speed-up ratio (Test model 1). PC2 PC1 L2 3 CPU PC2 PC1 5. AFW PC AFW 1) Jin, J.: The Finite Element Method in Electromagnetics, 2nd edition, John Wiley and Sons (2002). 2) Barrett, R., et al. Templates (1996). 3) Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edition, SIAM (2003). 4) Hackbusch, W.: Multigrid Methods and Applications, Springer-Verlag (1985). 5) Briggs, W.L.: A Multigrid Tutorial, 2nd edition, SIAM (2000). 6) Zhu, Y. and Cangellaris, A.C.: Multigrid Finite Element Methods for Electromagnetic Field Modeling, IEEE Press (2006). 7) Arnold, D., Falk, R. and Winther, R.: Multigrid in H(div) and H(curl), Numer. Math., Vol.85, pp.197 218 (2000). 8) Hiptmair, R.: Multigrid Method for Maxwell s Equations, SIAM J. Numer. Anal., Vol.36, pp.204 225 (1998). 9) Weiss, B., Biro, O., Caldera, P., Hollaus, K., Paoli, G. and Preis, K.: A Multigrid Solver for Time Harmonic Three-Dimensional Electromagnetic Wave Problems, IEEE Trans. Magn., Vol.42, No.4, pp.639 642 (2006). 10) PC B Vol.127, No.8, pp.911 917 (2007). 11) Doi, S. and Washio, T.: Ordering Strategies and Related Techniques to Overcome the Trade-off between Parallelism and Convergence in Incomplete Factorizations, Para. Comput., Vol.25, pp.1995 2014 (1999). 12) Iwashita, T. and Shimasaki, M.: Block Red-Black Ordering Method for Parallel Processing of ICCG Solver, Lecture Notes in Computer Science, Vol.2327, pp.175 189, Springer (2002). 13) Iwashita, T. and Shimasaki, M.: Algebraic Block Red-Black Ordering Method for Parallelized ICCG Solver with Fast Convergence and Low Communication Costs,
198 IEEE Trans. Magn., Vol.39, No.3, pp.1713 1716 (2003). 14) ICCG Vol.2009-HPC-121, No.11 (2009). 15) Watanabe, K., Igarashi, H. and Honma, T.: Convergence of Multigrid Method for Edge-Based Finite-Element Method, IEEE Trans. Magn., Vol.39, No.3 (2003). 16) Sogabe, T. and Zhang, S: A COCR Method for Solving Complex Symmetric Linear Systems, J. Comput. Appl. Math., Vol.199, No.2, pp.297 303 (2007). 17) Mifune, T., Takahashi, Y. and Iwashita T.: Folded Preconditioner: A New Class of Preconditioners for Krylov Subspace Methods to Solve Redundancy-Reduced Linear Systems of Equations, IEEE Trans. Magn., Vol.45, No.5, pp.2068 2075 (2009). 1998 1998 PD 2000 2003 2007 IEEE SIAM AEM ( 22 1 26 ) ( 22 5 4 ) 2003 2007 IEEE 2008 1987 1991 University of California Santa Barbara Department of Computer Science 1986 1988