[[wg:dynamo:Performance_results|Back to performance benchmark lists]] \\ ===== modules for libraries ===== module swap mvapich2 impi/4.1.0.030 ===== compile options ===== F90OPTFLAGS = -O3 -r8 -cpp -openmp -xhost ===== Notes ===== At least 3 MPI processes are required \\ 4 radial levels is minimum for each MPI process \\ Elapsed time is evaluated by inserting MPI_wtime() in parody.f90 \\ ===== Definition of columns ===== ^ name ^ ^ | # of Cores | Number of used CPU cores | | # of Processes | Number of MPI processes | | # of Threads | Number of threads for each process | | $l_{max}$ | Truncation lavel for spherical harmonincs | | $(N_{r},N_{\theta},N_{\phi})$ | Nuber of grids in spherical coordinate | | Elapsed | Elapsed (wall clock time) for one time step | | Nonlinear | Elapsed (wall clock time) for evaluation of nonlinear terms | | Solver | Elapsed (wall clock time) for linear solver (including communications) | | Efficiency | Parallel efficiency | | SUs | Service unit for $10^{4}$ time steps (Core hours) | ===== Four processors Result ===== ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ Elapsed ^ Nonlinear ^ Solver ^ SUs ^ | 47 | ( 73,72,144) | 0.269091 | 0.257912 | 0.00424973 | 0.747475 | ===== Strong Scaling Results ===== ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ | 255 | (512,384,768) | ^ # Cores ^ # Processes ^ # Threads ^ Elapsed ^ Nonlinear ^ Solver ^ Efficiency ^ SUs ^ | 16 | 4 | 4 | 12.54290 | 11.87211 | 0.287119 | 1.000000 | 557.462 | | 32 | 4 | 8 | 6.805739 | 6.191985 | 0.254288 | 0.921494 | 604.955 | | 32 | 8 | 4 | 6.363801 | 6.005817 | 0.163213 | 0.985488 | 565.671 | | 64 | 8 | 8 | 3.432315 | 3.104396 | 0.144843 | 0.913589 | 610.189 | | 64 | 16 | 4 | 3.209374 | 2.992346 | 0.116619 | 0.977052 | 570.555 | | 128 | 16 | 8 | 1.754511 | 1.551827 | 0.110802 | 0.893618 | 623.826 | | 128 | 32 | 4 | 1.685379 | 1.503336 | 0.127453 | 0.930273 | 599.246 | | 128 | 64 | 2 | 1.836404 | 1.561923 | 0.226236 | 0.853768 | 652.944 | | 128 | 128 | 1 | 2.535049 | 1.993672 | 0.481132 | 0.618474 | 901.351 | | 256 | 32 | 8 | 0.951109 | 0.779863 | 0.122069 | 0.824229 | 676.344 | | 256 | 64 | 4 | 0.997783 | 0.755018 | 0.191324 | 0.785673 | 709.535 | | 256 | 128 | 2 | 1.193223 | 0.779913 | 0.380951 | 0.656986 | 848.514 | | 512 | 64 | 8 | 0.6191725 | 0.393483 | 0.194016 | 0.633048 | 880.601 | | 512 | 128 | 4 | 0.736604 | 0.380441 | 0.333829 | 0.532125 | 1047.61 | | 1024 | 128 | 8 | 0.564325 | 0.199389 | 0.342527 | 0.347287 | 1605.19 | | 2048 | 128 | 16 | 0.575268 | 0.217482 | 0.336122 | 0.170340 | 3272.64 | {{wg:dynamo:Performance_results:IPGP:strong_scale_parody.png?480}}\\ Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers. Ideal scaling is plotted by dotted line. {{wg:dynamo:Performance_results:IPGP:strong_efficiency_parody.png?480}}\\ Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers. ===== Weak Scaling Results ===== ^ # Cores ^ # Processes ^ # Threads ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ Elapsed ^ Nonlinear ^ Solver ^ SUs ^ | 4 | 4 | 1 | 15 | (512,24,48) | 0.03682030 | 0.03474719 | 0.00099214 | 1.63646 | | 16 | 4 | 4 | 31 | (512,48,96) | 0.05962245 | 0.05093837 | 0.00317335 | 2.64989 | | 16 | 8 | 2 | 31 | (512,48,96) | 0.05066221 | 0.04486143 | 0.00221673 | 2.25165 | | 64 | 4 | 16 | 63 | (512,96,192) | 0.2261694 | 0.1915030 | 0.01298851 | 40.2079 | | 64 | 8 | 8 | 63 | (512,96,192) | 0.0879787 | 0.0693352 | 0.00814595 | 15.6407 | | 64 | 16 | 4 | 63 | (512,96,192) | 0.0764195 | 0.0639300 | 0.00679417 | 13.5857 | | 64 | 32 | 2 | 63 | (512,96,192) | 0.0811522 | 0.0682890 | 0.00898229 | 14.4271 | | 64 | 64 | 1 | 63 | (512,96,192) | 0.0872758 | 0.0677864 | 0.01562347 | 15.5157 | | 256 | 32 | 8 | 127 | (512,192,384) | 0.151465 | 0.109548 | 0.03015987 | 107.708 | | 256 | 64 | 4 | 127 | (512,192,384) | 0.158937 | 0.105684 | 0.04617718 | 113.022 | | 256 | 128 | 2 | 127 | (512,192,384) | 0.194962 | 0.103681 | 0.08474707 | 138.640 | | 1024 | 128 | 8 | 255 | (512,384,768) | 0.564325 | 0.199389 | 0.342527 | 1605.19 | {{wg:dynamo:Performance_results:IPGP:parody_weak_sph.png?480}}\\ Elapsed (wall clock) time for the weak scaling in the horizontal resolutions. Number of OpenMP threads are shown by the numbers. Ideal scaling for Legendre transform ($O_{N_{core}^{1/2}}$) is plotted by dotted line ^ # Cores ^ # Processes ^ # Threads ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ Elapsed ^ Nonlinear ^ Solver ^ SUs ^ | 64 | 8 | 8 | 255 | (32,384,768) | 0.23631219 | 0.196183 | 0.0222231 | 42.0111 | | 128 | 16 | 8 | 255 | (64,384,768) | 0.2575840 | 0.196314 | 0.043000 | 91.5854 | | 256 | 32 | 8 | 255 | (128,384,768) | 0.2998254 | 0.196222 | 0.084590 | 213.209 | | 512 | 64 | 8 | 255 | (256,384,768) | 0.3866510 | 0.197542 | 0.169289 | 549.904 | | 1024 | 128 | 8 | 255 | (512,384,768) | 0.564325 | 0.199389 | 0.342527 | 1605.19 | | 2048 | 128 | 16 | 255 | (1024,384,768) | 0.82399 | 0.427062 | 0.368039 | 4687.59 | | 2048 | 256 | 8 | 255 | (1024,384,768) | 0.903354 | 0.203544 | 0.675003 | 5139.08 | {{wg:dynamo:Performance_results:IPGP:parody_weak_r.png?480}}\\ Elapsed time for the weak scaling in the radial resolution. The results with 8 OpenMP threads are shown. Ideal scaling for Legendre transform ($O_{N_{core}^{1/2}}$) is plotted by dotted line [[wg:dynamo:Performance_results|Back to performance benchmark lists]] \\ [[wg:dynamo:Performance_results:IPGP:files|files]]