Table of Contents

Back to performance benchmark lists

modules for libraries

module swap mvapich2 impi/4.1.0.030

(NetCDF4.3.2 is compiled locally excluding HDF5 features)

compile options

-openmp -O3 -xAVX -align array32byte

Restrictions

Number of zonal grids has to be power of 2

Definition of columns

name
# of Cores Number of used CPU cores
# of Processes Number of MPI processes
# of Threads Number of threads for each process
$N_{c}$ Truncation lavel for Chebyshev polynomial
$l_{max}$ Truncation lavel for spherical harmonincs
$(N_{r},N_{\theta},N_{\phi})$ Nuber of grids in spherical coordinate
Elapsed Elapsed (wall clock time) for one time step
Nonlinear Elapsed (wall clock time) for evaluation of nonlinear terms
Solver Elapsed (wall clock time) for linear solver (including communications)
Efficiency Parallel efficiency
SUs Service unit for $10^{4}$ time steps (Core hours)

Single Processor Result

$N_{c}$ $l_{max}$ $(N_{r},N_{\theta},N_{\phi})$ Elapsed Nonlinear Solver
47 48 (48,64,128) 0.445464 0.233546 0.183094

Strong Scaling Results

In the present test, spatial resolution is fixed, and change the parallelization. Elapsed time is inverse proportion to the number of Cores in ideal scaling.

$N_{c}$ $l_{max}$ $(N_{r},N_{\theta},N_{\phi})$
48 85 (48,128,256)
# of Cores # of Processes # of Threads Elapsed Nonlinear Solver Efficiency SUs
1 1 1 2.12702 1.13331 0.837597 1.0 15.8769
2 2 1 1.53197 0.771775 0.666242 0.694213 8.32400
4 2 2 1.41881 0.677849 0.657223 0.374789 4.57244
8 2 4 1.29347 0.644326 0.586741 0.205553 2.57067
16 2 8 1.20088 0.577998 0.562982 0.110701 2.57067
32 2 16 4.45394 1.40103 2.96418 0.0149237 2.57067
4 4 1 1.52251 0.61794 0.838045 0.349261 8.32400
8 4 2 1.308 0.573778 0.669746 0.20327 4.57244
16 4 4 1.19526 0.551808 0.592631 0.111222 2.57067
32 4 8 1.11331 0.503031 0.560936 0.0597045 2.57067
64 4 16 1.11331 1.77167 3.03814 0.00680244 2.57067
8 8 1 4.8857 0.681461 0.913425 0.160511 8.32400
16 8 2 1.46565 0.661892 0.743852 0.0907029 4.57244
32 8 4 1.14545 0.505259 0.596385 0.0580289 2.57067
64 8 8 1.06662 0.465445 0.562054 0.031159 2.57067
8 16 1 3.08775 1.58078 1.37636 0.0430535 8.32400
6 16 2 1.45392 0.658729 0.74527 0.0457173 4.57244
32 16 4 1.13421 0.497914 0.593822 0.029302 2.57067
64 16 8 1.06704 0.467572 0.561543 0.0155732 2.57067
$N_{c}$ $l_{max}$ $(N_{r},N_{\theta},N_{\phi})$
97 170 (145,256,512)
# of Cores # of Processes # of Threads Elapsed Nonlinear Solver Efficiency SUs
8 1 8 11.3106 4.97222 5.77018 1.0 814.222
16 2 8 9.19364 3.65295 5.16199 1.05368 386.372
32 4 8 8.79569 3.32431 5.15045 1.01463 200.621
64 8 8 8.66073 3.21434 5.16300 0.904766 112.491
128 16 8 8.25554 2.83906 5.17103 0.528483 96.2924
256 32 8 8.50060 3.08209 5.17536 0.528483 96.2924


Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers. Ideal scaling is plotted by dotted line.


Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.

Weak Scaling Results

Weak Scaling in horizontal direction

In the present benchmark, radial resolution is fixed, and horizontal resolution is increased with parallelization.

# of Cores # of Processes # of Threads $N_{c}$ $l_{max}$ $(N_{r},N_{\theta},N_{\phi})$ Elapsed Nonlinear Solver SUs
4 1 4 64 42 (64,64,128) 0.307596 0.296679 13.6710
16 2 8 64 85 (64,128,256) 0.773182 1.00419 34.3636
64 8 8 64 170 (64,256,512) 3.21434 5.16300 142.860
256 16 8 64 341 (64,512,1024) 39.123 37.1453 1738.80


Elapsed (wall clock) time for the weak scaling in horizontal directions. Number of OpenMP threads are shown by the numbers.

Note

Time is evaluated from average over 100 time steps.
Only spherical harmonics transform and evaluation of the nonlinear terms are parallelized

Back to performance benchmark lists
files