Table of Contents

Back to performance benchmark lists

modules for libraries

module swap mvapich2 impi/4.1.0.030

compile options

F90OPTFLAGS = -O3 -r8 -cpp -openmp -xhost

Notes

At least 3 MPI processes are required
4 radial levels is minimum for each MPI process
Elapsed time is evaluated by inserting MPI_wtime() in parody.f90

Definition of columns

name
# of Cores Number of used CPU cores
# of Processes Number of MPI processes
# of Threads Number of threads for each process
$l_{max}$ Truncation lavel for spherical harmonincs
$(N_{r},N_{\theta},N_{\phi})$ Nuber of grids in spherical coordinate
Elapsed Elapsed (wall clock time) for one time step
Nonlinear Elapsed (wall clock time) for evaluation of nonlinear terms
Solver Elapsed (wall clock time) for linear solver (including communications)
Efficiency Parallel efficiency
SUs Service unit for $10^{4}$ time steps (Core hours)

Four processors Result

$l_{max}$ $(N_{r},N_{\theta},N_{\phi})$ Elapsed Nonlinear Solver SUs
47 ( 73,72,144) 0.269091 0.257912 0.00424973 0.747475

Strong Scaling Results

$l_{max}$ $(N_{r},N_{\theta},N_{\phi})$
255 (512,384,768)
# Cores # Processes # Threads Elapsed Nonlinear Solver Efficiency SUs
16 4 4 12.54290 11.87211 0.287119 1.000000 557.462
32 4 8 6.805739 6.191985 0.254288 0.921494 604.955
32 8 4 6.363801 6.005817 0.163213 0.985488 565.671
64 8 8 3.432315 3.104396 0.144843 0.913589 610.189
64 16 4 3.209374 2.992346 0.116619 0.977052 570.555
128 16 8 1.754511 1.551827 0.110802 0.893618 623.826
128 32 4 1.685379 1.503336 0.127453 0.930273 599.246
128 64 2 1.836404 1.561923 0.226236 0.853768 652.944
128 128 1 2.535049 1.993672 0.481132 0.618474 901.351
256 32 8 0.951109 0.779863 0.122069 0.824229 676.344
256 64 4 0.997783 0.755018 0.191324 0.785673 709.535
256 128 2 1.193223 0.779913 0.380951 0.656986 848.514
512 64 8 0.6191725 0.393483 0.194016 0.633048 880.601
512 128 4 0.736604 0.380441 0.333829 0.532125 1047.61
1024 128 8 0.564325 0.199389 0.342527 0.347287 1605.19
2048 128 16 0.575268 0.217482 0.336122 0.170340 3272.64


Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers. Ideal scaling is plotted by dotted line.


Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.

Weak Scaling Results

# Cores # Processes # Threads $l_{max}$ $(N_{r},N_{\theta},N_{\phi})$ Elapsed Nonlinear Solver SUs
4 4 1 15 (512,24,48) 0.03682030 0.03474719 0.00099214 1.63646
16 4 4 31 (512,48,96) 0.05962245 0.05093837 0.00317335 2.64989
16 8 2 31 (512,48,96) 0.05066221 0.04486143 0.00221673 2.25165
64 4 16 63 (512,96,192) 0.2261694 0.1915030 0.01298851 40.2079
64 8 8 63 (512,96,192) 0.0879787 0.0693352 0.00814595 15.6407
64 16 4 63 (512,96,192) 0.0764195 0.0639300 0.00679417 13.5857
64 32 2 63 (512,96,192) 0.0811522 0.0682890 0.00898229 14.4271
64 64 1 63 (512,96,192) 0.0872758 0.0677864 0.01562347 15.5157
256 32 8 127 (512,192,384) 0.151465 0.109548 0.03015987 107.708
256 64 4 127 (512,192,384) 0.158937 0.105684 0.04617718 113.022
256 128 2 127 (512,192,384) 0.194962 0.103681 0.08474707 138.640
1024 128 8 255 (512,384,768) 0.564325 0.199389 0.342527 1605.19


Elapsed (wall clock) time for the weak scaling in the horizontal resolutions. Number of OpenMP threads are shown by the numbers. Ideal scaling for Legendre transform ($O_{N_{core}^{1/2}}$) is plotted by dotted line

# Cores # Processes # Threads $l_{max}$ $(N_{r},N_{\theta},N_{\phi})$ Elapsed Nonlinear Solver SUs
64 8 8 255 (32,384,768) 0.23631219 0.196183 0.0222231 42.0111
128 16 8 255 (64,384,768) 0.2575840 0.196314 0.043000 91.5854
256 32 8 255 (128,384,768) 0.2998254 0.196222 0.084590 213.209
512 64 8 255 (256,384,768) 0.3866510 0.197542 0.169289 549.904
1024 128 8 255 (512,384,768) 0.564325 0.199389 0.342527 1605.19
2048 128 16 255 (1024,384,768) 0.82399 0.427062 0.368039 4687.59
2048 256 8 255 (1024,384,768) 0.903354 0.203544 0.675003 5139.08


Elapsed time for the weak scaling in the radial resolution. The results with 8 OpenMP threads are shown. Ideal scaling for Legendre transform ($O_{N_{core}^{1/2}}$) is plotted by dotted line

Back to performance benchmark lists
files