wg:dynamo:performance_results:dennou

modules for libraries
compile options
Restrictions
Definition of columns
Single Processor Result
Strong Scaling Results
Weak Scaling Results
Note

Back to performance benchmark lists

modules for libraries

module swap mvapich2 impi/4.1.0.030

(NetCDF4.3.2 is compiled locally excluding HDF5 features)

compile options

-openmp -O3 -xAVX -align array32byte

Restrictions

Number of zonal grids has to be power of 2

Definition of columns

name
# of Cores	Number of used CPU cores
# of Processes	Number of MPI processes
# of Threads	Number of threads for each process
$N_{c}$	Truncation lavel for Chebyshev polynomial
$l_{max}$	Truncation lavel for spherical harmonincs
$(N_{r},N_{\theta},N_{\phi})$	Nuber of grids in spherical coordinate
Elapsed	Elapsed (wall clock time) for one time step
Nonlinear	Elapsed (wall clock time) for evaluation of nonlinear terms
Solver	Elapsed (wall clock time) for linear solver (including communications)
Efficiency	Parallel efficiency
SUs	Service unit for $10^{4}$ time steps (Core hours)

Single Processor Result

$N_{c}$	$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$	Elapsed	Nonlinear	Solver
47	48	(48,64,128)	0.445464	0.233546	0.183094

Strong Scaling Results

In the present test, spatial resolution is fixed, and change the parallelization. Elapsed time is inverse proportion to the number of Cores in ideal scaling.

$N_{c}$	$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$
48	85	(48,128,256)

# of Cores	# of Processes	# of Threads	Elapsed	Nonlinear	Solver	Efficiency	SUs
1	1	1	2.12702	1.13331	0.837597	1.0	15.8769
2	2	1	1.53197	0.771775	0.666242	0.694213	8.32400
4	2	2	1.41881	0.677849	0.657223	0.374789	4.57244
8	2	4	1.29347	0.644326	0.586741	0.205553	2.57067
16	2	8	1.20088	0.577998	0.562982	0.110701	2.57067
32	2	16	4.45394	1.40103	2.96418	0.0149237	2.57067
4	4	1	1.52251	0.61794	0.838045	0.349261	8.32400
8	4	2	1.308	0.573778	0.669746	0.20327	4.57244
16	4	4	1.19526	0.551808	0.592631	0.111222	2.57067
32	4	8	1.11331	0.503031	0.560936	0.0597045	2.57067
64	4	16	1.11331	1.77167	3.03814	0.00680244	2.57067
8	8	1	4.8857	0.681461	0.913425	0.160511	8.32400
16	8	2	1.46565	0.661892	0.743852	0.0907029	4.57244
32	8	4	1.14545	0.505259	0.596385	0.0580289	2.57067
64	8	8	1.06662	0.465445	0.562054	0.031159	2.57067
8	16	1	3.08775	1.58078	1.37636	0.0430535	8.32400
6	16	2	1.45392	0.658729	0.74527	0.0457173	4.57244
32	16	4	1.13421	0.497914	0.593822	0.029302	2.57067
64	16	8	1.06704	0.467572	0.561543	0.0155732	2.57067

$N_{c}$	$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$
97	170	(145,256,512)

# of Cores	# of Processes	# of Threads	Elapsed	Nonlinear	Solver	Efficiency	SUs
8	1	8	11.3106	4.97222	5.77018	1.0	814.222
16	2	8	9.19364	3.65295	5.16199	1.05368	386.372
32	4	8	8.79569	3.32431	5.15045	1.01463	200.621
64	8	8	8.66073	3.21434	5.16300	0.904766	112.491
128	16	8	8.25554	2.83906	5.17103	0.528483	96.2924
256	32	8	8.50060	3.08209	5.17536	0.528483	96.2924

Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers. Ideal scaling is plotted by dotted line.

Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.

Weak Scaling Results

Weak Scaling in horizontal direction

In the present benchmark, radial resolution is fixed, and horizontal resolution is increased with parallelization.

# of Cores	# of Processes	# of Threads	$N_{c}$	$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$	Elapsed	Nonlinear	Solver
4	1	4	64	42	(64,64,128)	0.307596	0.296679	13.6710
16	2	8	64	85	(64,128,256)	0.773182	1.00419	34.3636
64	8	8	64	170	(64,256,512)	3.21434	5.16300	142.860
256	16	8	64	341	(64,512,1024)	39.123	37.1453	1738.80

Elapsed (wall clock) time for the weak scaling in horizontal directions. Number of OpenMP threads are shown by the numbers.

Note

Time is evaluated from average over 100 time steps.
Only spherical harmonics transform and evaluation of the nonlinear terms are parallelized

Back to performance benchmark lists
files

Table of Contents