[[wg:dynamo:Performance_results|Back to performance benchmark lists]] \\ ===== modules for libraries ===== module swap mvapich2 impi/4.1.0.030 (NetCDF4.3.2 is compiled locally excluding HDF5 features) ===== compile options ===== -openmp -O3 -xAVX -align array32byte ===== Restrictions ===== Number of zonal grids has to be power of 2 ===== Definition of columns ===== ^ name ^ ^ | # of Cores | Number of used CPU cores | | # of Processes | Number of MPI processes | | # of Threads | Number of threads for each process | | $N_{c}$ | Truncation lavel for Chebyshev polynomial | | $l_{max}$ | Truncation lavel for spherical harmonincs | | $(N_{r},N_{\theta},N_{\phi})$ | Nuber of grids in spherical coordinate | | Elapsed | Elapsed (wall clock time) for one time step | | Nonlinear | Elapsed (wall clock time) for evaluation of nonlinear terms | | Solver | Elapsed (wall clock time) for linear solver (including communications) | | Efficiency | Parallel efficiency | | SUs | Service unit for $10^{4}$ time steps (Core hours) | ===== Single Processor Result ===== ^ $N_{c}$ ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ Elapsed ^ Nonlinear ^ Solver ^ | 47 | 48 | (48,64,128) | 0.445464 | 0.233546 | 0.183094 | ===== Strong Scaling Results ===== In the present test, spatial resolution is fixed, and change the parallelization. Elapsed time is inverse proportion to the number of Cores in ideal scaling. ^ $N_{c}$ ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ | 48 | 85 | (48,128,256) | ^ # of Cores ^ # of Processes ^ # of Threads ^ Elapsed ^ Nonlinear ^ Solver ^ Efficiency ^ SUs ^ | 1 | 1 | 1 | 2.12702 | 1.13331 | 0.837597 | 1.0 | 15.8769 | | 2 | 2 | 1 | 1.53197 | 0.771775 | 0.666242 | 0.694213 | 8.32400 | | 4 | 2 | 2 | 1.41881 | 0.677849 | 0.657223 | 0.374789 | 4.57244 | | 8 | 2 | 4 | 1.29347 | 0.644326 | 0.586741 | 0.205553 | 2.57067 | | 16 | 2 | 8 | 1.20088 | 0.577998 | 0.562982 | 0.110701 | 2.57067 | | 32 | 2 | 16 | 4.45394 | 1.40103 | 2.96418 | 0.0149237 | 2.57067 | | 4 | 4 | 1 | 1.52251 | 0.61794 | 0.838045 | 0.349261 | 8.32400 | | 8 | 4 | 2 | 1.308 | 0.573778 | 0.669746 | 0.20327 | 4.57244 | | 16 | 4 | 4 | 1.19526 | 0.551808 | 0.592631 | 0.111222 | 2.57067 | | 32 | 4 | 8 | 1.11331 | 0.503031 | 0.560936 | 0.0597045 | 2.57067 | | 64 | 4 | 16 | 1.11331 | 1.77167 | 3.03814 | 0.00680244 | 2.57067 | | 8 | 8 | 1 | 4.8857 | 0.681461 | 0.913425 | 0.160511 | 8.32400 | | 16 | 8 | 2 | 1.46565 | 0.661892 | 0.743852 | 0.0907029 | 4.57244 | | 32 | 8 | 4 | 1.14545 | 0.505259 | 0.596385 | 0.0580289 | 2.57067 | | 64 | 8 | 8 | 1.06662 | 0.465445 | 0.562054 | 0.031159 | 2.57067 | | 8 | 16 | 1 | 3.08775 | 1.58078 | 1.37636 | 0.0430535 | 8.32400 | | 6 | 16 | 2 | 1.45392 | 0.658729 | 0.74527 | 0.0457173 | 4.57244 | | 32 | 16 | 4 | 1.13421 | 0.497914 | 0.593822 | 0.029302 | 2.57067 | | 64 | 16 | 8 | 1.06704 | 0.467572 | 0.561543 | 0.0155732 | 2.57067 | ^ $N_{c}$ ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ | 97 | 170 | (145,256,512) | ^ # of Cores ^ # of Processes ^ # of Threads ^ Elapsed ^ Nonlinear ^ Solver ^ Efficiency ^ SUs ^ | 8 | 1 | 8 | 11.3106 | 4.97222 | 5.77018 | 1.0 | 814.222 | | 16 | 2 | 8 | 9.19364 | 3.65295 | 5.16199 | 1.05368 | 386.372 | | 32 | 4 | 8 | 8.79569 | 3.32431 | 5.15045 | 1.01463 | 200.621 | | 64 | 8 | 8 | 8.66073 | 3.21434 | 5.16300 | 0.904766 | 112.491 | | 128 | 16 | 8 | 8.25554 | 2.83906 | 5.17103 | 0.528483 | 96.2924 | | 256 | 32 | 8 | 8.50060 | 3.08209 | 5.17536 | 0.528483 | 96.2924 | {{wg:dynamo:Performance_results:Dennou:SPmodel_elapsed.png?480}}\\ Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers. Ideal scaling is plotted by dotted line. {{wg:dynamo:Performance_results:Dennou:SPmodel_efficiency.png?480}}\\ Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers. ===== Weak Scaling Results ===== === Weak Scaling in horizontal direction === In the present benchmark, radial resolution is fixed, and horizontal resolution is increased with parallelization. ^ # of Cores ^ # of Processes ^ # of Threads ^ $N_{c}$ ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ Elapsed ^ Nonlinear ^ Solver ^ SUs ^ | 4 | 1 | 4 | 64 | 42 | (64,64,128) | 0.307596 | 0.296679 | 13.6710 | | 16 | 2 | 8 | 64 | 85 | (64,128,256) | 0.773182 | 1.00419 | 34.3636 | | 64 | 8 | 8 | 64 | 170 | (64,256,512) | 3.21434 | 5.16300 | 142.860 | | 256 | 16 | 8 | 64 | 341 | (64,512,1024) | 39.123 | 37.1453 | 1738.80 | {{wg:dynamo:Performance_results:Dennou:spmodel_weak_sph.png?480}}\\ Elapsed (wall clock) time for the weak scaling in horizontal directions. Number of OpenMP threads are shown by the numbers. ===== Note ===== Time is evaluated from average over 100 time steps. \\ Only spherical harmonics transform and evaluation of the nonlinear terms are parallelized \\ [[wg:dynamo:Performance_results|Back to performance benchmark lists]] \\ [[wg:dynamo:Performance_results:Dennou:files|files]]