wg:dynamo:performance

modules for libraries
compile options
Notes
Definition of columns
Four processors Result
Strong Scaling Results
Weak Scaling Results

modules for libraries

module swap mvapich2 impi/4.1.0.030

compile options

F90OPTFLAGS = -O3 -r8 -cpp -openmp -xhost

Notes

At least 3 MPI processes are required
4 radial levels is minimum for each MPI process
Elapsed time is evaluated by inserting MPI_wtime() in parody.f90

Definition of columns

name
# of Cores	Number of used CPU cores
# of Processes	Number of MPI processes
# of Threads	Number of threads for each process
$l_{max}$	Truncation lavel for spherical harmonincs
$(N_{r},N_{\theta},N_{\phi})$	Nuber of grids in spherical coordinate
Elapsed	Elapsed (wall clock time) for one time step
Nonlinear	Elapsed (wall clock time) for evaluation of nonlinear terms
Solver	Elapsed (wall clock time) for linear solver (including communications)
Efficiency	Parallel efficiency
SUs	Service unit for $10^{4}$ time steps (Core hours)

Four processors Result

$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$	Elapsed	Nonlinear	Solver	SUs
47	( 73,72,144)	0.269091	0.257912	0.00424973	0.747475

Strong Scaling Results

$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$
255	(512,384,768)

# Cores	# Processes	# Threads	Elapsed	Nonlinear	Solver	Efficiency	SUs
16	4	4	12.54290	11.87211	0.287119	1.000000	557.462
32	4	8	6.805739	6.191985	0.254288	0.921494	604.955
32	8	4	6.363801	6.005817	0.163213	0.985488	565.671
64	8	8	3.432315	3.104396	0.144843	0.913589	610.189
64	16	4	3.209374	2.992346	0.116619	0.977052	570.555
128	16	8	1.754511	1.551827	0.110802	0.893618	623.826
128	32	4	1.685379	1.503336	0.127453	0.930273	599.246
128	64	2	1.836404	1.561923	0.226236	0.853768	652.944
128	128	1	2.535049	1.993672	0.481132	0.618474	901.351
256	32	8	0.951109	0.779863	0.122069	0.824229	676.344
256	64	4	0.997783	0.755018	0.191324	0.785673	709.535
256	128	2	1.193223	0.779913	0.380951	0.656986	848.514
512	64	8	0.6191725	0.393483	0.194016	0.633048	880.601
512	128	4	0.736604	0.380441	0.333829	0.532125	1047.61
1024	128	8	0.564325	0.199389	0.342527	0.347287	1605.19
2048	128	16	0.575268	0.217482	0.336122	0.170340	3272.64

Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers. Ideal scaling is plotted by dotted line.

Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.

Weak Scaling Results

# Cores	# Processes	# Threads	$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$	Elapsed	Nonlinear	Solver	SUs
4	4	1	15	(512,24,48)	0.03682030	0.03474719	0.00099214	1.63646
16	4	4	31	(512,48,96)	0.05962245	0.05093837	0.00317335	2.64989
16	8	2	31	(512,48,96)	0.05066221	0.04486143	0.00221673	2.25165
64	4	16	63	(512,96,192)	0.2261694	0.1915030	0.01298851	40.2079
64	8	8	63	(512,96,192)	0.0879787	0.0693352	0.00814595	15.6407
64	16	4	63	(512,96,192)	0.0764195	0.0639300	0.00679417	13.5857
64	32	2	63	(512,96,192)	0.0811522	0.0682890	0.00898229	14.4271
64	64	1	63	(512,96,192)	0.0872758	0.0677864	0.01562347	15.5157
256	32	8	127	(512,192,384)	0.151465	0.109548	0.03015987	107.708
256	64	4	127	(512,192,384)	0.158937	0.105684	0.04617718	113.022
256	128	2	127	(512,192,384)	0.194962	0.103681	0.08474707	138.640
1024	128	8	255	(512,384,768)	0.564325	0.199389	0.342527	1605.19

Elapsed (wall clock) time for the weak scaling in the horizontal resolutions. Number of OpenMP threads are shown by the numbers. Ideal scaling for Legendre transform ($O_{N_{core}^{1/2}}$) is plotted by dotted line

# Cores	# Processes	# Threads	$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$	Elapsed	Nonlinear	Solver	SUs
64	8	8	255	(32,384,768)	0.23631219	0.196183	0.0222231	42.0111
128	16	8	255	(64,384,768)	0.2575840	0.196314	0.043000	91.5854
256	32	8	255	(128,384,768)	0.2998254	0.196222	0.084590	213.209
512	64	8	255	(256,384,768)	0.3866510	0.197542	0.169289	549.904
1024	128	8	255	(512,384,768)	0.564325	0.199389	0.342527	1605.19
2048	128	16	255	(1024,384,768)	0.82399	0.427062	0.368039	4687.59
2048	256	8	255	(1024,384,768)	0.903354	0.203544	0.675003	5139.08

Elapsed time for the weak scaling in the radial resolution. The results with 8 OpenMP threads are shown. Ideal scaling for Legendre transform ($O_{N_{core}^{1/2}}$) is plotted by dotted line

Back to performance benchmark lists
files

Table of Contents