Back to performance benchmark lists
module swap mvapich2 impi/4.1.0.030
(NetCDF4.3.2 is compiled locally excluding HDF5 features)
-openmp -O3 -xAVX -align array32byte
Number of zonal grids has to be power of 2
name | |
---|---|
# of Cores | Number of used CPU cores |
# of Processes | Number of MPI processes |
# of Threads | Number of threads for each process |
$N_{c}$ | Truncation lavel for Chebyshev polynomial |
$l_{max}$ | Truncation lavel for spherical harmonincs |
$(N_{r},N_{\theta},N_{\phi})$ | Nuber of grids in spherical coordinate |
Elapsed | Elapsed (wall clock time) for one time step |
Nonlinear | Elapsed (wall clock time) for evaluation of nonlinear terms |
Solver | Elapsed (wall clock time) for linear solver (including communications) |
Efficiency | Parallel efficiency |
SUs | Service unit for $10^{4}$ time steps (Core hours) |
$N_{c}$ | $l_{max}$ | $(N_{r},N_{\theta},N_{\phi})$ | Elapsed | Nonlinear | Solver |
---|---|---|---|---|---|
47 | 48 | (48,64,128) | 0.445464 | 0.233546 | 0.183094 |
In the present test, spatial resolution is fixed, and change the parallelization. Elapsed time is inverse proportion to the number of Cores in ideal scaling.
$N_{c}$ | $l_{max}$ | $(N_{r},N_{\theta},N_{\phi})$ |
---|---|---|
48 | 85 | (48,128,256) |
# of Cores | # of Processes | # of Threads | Elapsed | Nonlinear | Solver | Efficiency | SUs |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 2.12702 | 1.13331 | 0.837597 | 1.0 | 15.8769 |
2 | 2 | 1 | 1.53197 | 0.771775 | 0.666242 | 0.694213 | 8.32400 |
4 | 2 | 2 | 1.41881 | 0.677849 | 0.657223 | 0.374789 | 4.57244 |
8 | 2 | 4 | 1.29347 | 0.644326 | 0.586741 | 0.205553 | 2.57067 |
16 | 2 | 8 | 1.20088 | 0.577998 | 0.562982 | 0.110701 | 2.57067 |
32 | 2 | 16 | 4.45394 | 1.40103 | 2.96418 | 0.0149237 | 2.57067 |
4 | 4 | 1 | 1.52251 | 0.61794 | 0.838045 | 0.349261 | 8.32400 |
8 | 4 | 2 | 1.308 | 0.573778 | 0.669746 | 0.20327 | 4.57244 |
16 | 4 | 4 | 1.19526 | 0.551808 | 0.592631 | 0.111222 | 2.57067 |
32 | 4 | 8 | 1.11331 | 0.503031 | 0.560936 | 0.0597045 | 2.57067 |
64 | 4 | 16 | 1.11331 | 1.77167 | 3.03814 | 0.00680244 | 2.57067 |
8 | 8 | 1 | 4.8857 | 0.681461 | 0.913425 | 0.160511 | 8.32400 |
16 | 8 | 2 | 1.46565 | 0.661892 | 0.743852 | 0.0907029 | 4.57244 |
32 | 8 | 4 | 1.14545 | 0.505259 | 0.596385 | 0.0580289 | 2.57067 |
64 | 8 | 8 | 1.06662 | 0.465445 | 0.562054 | 0.031159 | 2.57067 |
8 | 16 | 1 | 3.08775 | 1.58078 | 1.37636 | 0.0430535 | 8.32400 |
6 | 16 | 2 | 1.45392 | 0.658729 | 0.74527 | 0.0457173 | 4.57244 |
32 | 16 | 4 | 1.13421 | 0.497914 | 0.593822 | 0.029302 | 2.57067 |
64 | 16 | 8 | 1.06704 | 0.467572 | 0.561543 | 0.0155732 | 2.57067 |
$N_{c}$ | $l_{max}$ | $(N_{r},N_{\theta},N_{\phi})$ |
---|---|---|
97 | 170 | (145,256,512) |
# of Cores | # of Processes | # of Threads | Elapsed | Nonlinear | Solver | Efficiency | SUs |
---|---|---|---|---|---|---|---|
8 | 1 | 8 | 11.3106 | 4.97222 | 5.77018 | 1.0 | 814.222 |
16 | 2 | 8 | 9.19364 | 3.65295 | 5.16199 | 1.05368 | 386.372 |
32 | 4 | 8 | 8.79569 | 3.32431 | 5.15045 | 1.01463 | 200.621 |
64 | 8 | 8 | 8.66073 | 3.21434 | 5.16300 | 0.904766 | 112.491 |
128 | 16 | 8 | 8.25554 | 2.83906 | 5.17103 | 0.528483 | 96.2924 |
256 | 32 | 8 | 8.50060 | 3.08209 | 5.17536 | 0.528483 | 96.2924 |
Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers. Ideal scaling is plotted by dotted line.
Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.
In the present benchmark, radial resolution is fixed, and horizontal resolution is increased with parallelization.
# of Cores | # of Processes | # of Threads | $N_{c}$ | $l_{max}$ | $(N_{r},N_{\theta},N_{\phi})$ | Elapsed | Nonlinear | Solver | SUs |
---|---|---|---|---|---|---|---|---|---|
4 | 1 | 4 | 64 | 42 | (64,64,128) | 0.307596 | 0.296679 | 13.6710 | |
16 | 2 | 8 | 64 | 85 | (64,128,256) | 0.773182 | 1.00419 | 34.3636 | |
64 | 8 | 8 | 64 | 170 | (64,256,512) | 3.21434 | 5.16300 | 142.860 | |
256 | 16 | 8 | 64 | 341 | (64,512,1024) | 39.123 | 37.1453 | 1738.80 |
Elapsed (wall clock) time for the weak scaling in horizontal directions. Number of OpenMP threads are shown by the numbers.
Time is evaluated from average over 100 time steps.
Only spherical harmonics transform and evaluation of the nonlinear terms are parallelized