Table of Contents

Back to performance benchmark lists

modules for libraries

module swap mvapich2 impi/4.1.0.030
module load phdf5 netcdf fftw3

compile options

F90OPTFLAGS = -cpp -c -O3 -ip -ipo -xhost

Notes

Nonlinear terms (spherical transform) is evaluated twice for each time step.
Elapsed time is evaluated by inserting MPI_wtime() in main.f90
Time is evaluated from average over 100 time steps.

Definition of columns

name
# of Cores Number of used CPU cores
# of Processes Number of MPI processes
# of PE for $r$ Number of MPI processes in radial
$l_{max}$ Truncation lavel for spherical harmonincs
$(N_{r},N_{\theta},N_{\phi})$ Nuber of grids in spherical coordinate
Elapsed Elapsed (wall clock time) for one time step
Nonlinear Elapsed (wall clock time) for evaluation of nonlinear terms
Solver Elapsed (wall clock time) for linear solver (including communications)
Efficiency Parallel efficiency
SUs Service unit for $10^{4}$ time steps (Core hours)

Single processors Result

$l_{max}$ $(N_{r},N_{\theta},N_{\phi})$ Elapsed Nonlinear Solver SUs
48 ( 73,72,144) 0.774053 0.647997 0.109058 34.4023

Strong Scaling Results

$l_{max}$ $(N_{r},N_{\theta},N_{\phi})$
48 (73,72,144)
# Cores # Processes # of PE for $r$ Elapsed Nonlinear Solver Efficiency SUs
1 1 1 0.774053 0.647997 0.109058 1.00 34.4023
2 2 2 0.385259 0.320167 0.0551415 1.00459 17.1226
4 4 2 0.209187 0.174623 0.0281849 0.925072 9.2972
8 8 4 0.119906 0.100789 0.0145561 0.806939 5.32914
16 16 4 0.0788227 0.0672625 0.00788815 0.613761 3.50323
$l_{max}$ $(N_{r},N_{\theta},N_{\phi})$
256 (512,384,768)
# Cores # Processes # of PE for $r$ Elapsed Nonlinear Solver Efficiency SUs
64 64 8 15.1231 14.6107 0.341912 1.00 2688.55
128 128 8 7.7779 7.46396 0.174534 0.972183 2765.47
256 256 16 4.08026 3.85867 0.0918966 0.926600 2901.52
512 512 16 2.22781 2.05299 0.048329 0.848538 3168.45
1024 1024 32 1.3467 1.17748 0.0416313 0.70186 3830.60
2048 2048 32 0.852113 0.689238 0.0302918 0.554617 4847.58
4096 4096 32 0.374556 0.249321 0.0101704 0.501042 4261.62


Elapsed (wall clock) time for the strong scaling for $l_{max} = 256$, $(N_{r}, N_{\theta}, N_{\phi}) = (512,284,768)$ case. Number of subdomain in the radial direction is shown by the numbers. Ideal scaling is plotted by dotted line.


Parallel Efficiency for the strong scaling for $l_{max} = 256$, $(N_{r}, N_{\theta}, N_{\phi}) = (512,284,768)$ case. Number of subdomain in the radial direction is shown by the numbers.

Weak Scaling Results

# Cores # Processes # of PE for $r$ $l_{max}$ $(N_{r},N_{\theta},N_{\phi})$ Elapsed Nonlinear Solver SUs
64 64 32 32 (512,48,96) 0.0775688 0.0565 0.00624115 13.79
256 64 32 64 (512,96,192) 0.142538 0.106911 0.00610947 101.361
1024 1024 32 128 (512,192,384) 0.203346 0.137757 0.00715795 578.407
4096 4096 32 256 (512,384,768) 0.374556 0.249321 0.0101704 4261.62


Elapsed (wall clock) time for the weak scaling in the horizontal resolutions. Number of processes for the radial directions is fixed to 32. Ideal scaling for the Legendre transform ($ = O(L_{max}^3$} is plotted by dotted lines.

# Cores # Processes # of PE for $r$ $l_{max}$ $(N_{r},N_{\theta},N_{\phi})$ Elapsed Nonlinear Solver SUs
256 256 2 255 (32,384,768) 0.221005 0.206752 0.0071895 157.159
512 512 4 255 (64,384,768) 0.231582 0.211148 0.0067766 329.361
1024 1024 8 255 (128,384,768) 0.268191 0.231129 0.0070692 762.854
2048 2048 16 255 (256,384,768) 0.45460 0.374649 0.0152046 2586.17
4096 4096 32 255 (512,384,768) 0.374556 0.249321 0.0101704 4261.62


Elapsed time for the weak scaling in the radial resolution. Number of processes for the spherical harmonics is fixed to 128. Ideal scaling is plotted by dotted lines.

Back to performance benchmark lists