wg:dynamo:performance

modules for libraries
compile options
Notes
Definition of columns
Single processors Result
Strong Scaling Results
Weak Scaling Results

modules for libraries

module swap mvapich2 impi/4.1.0.030
module load phdf5 netcdf fftw3

compile options

F90OPTFLAGS = -cpp -c -O3 -ip -ipo -xhost

Notes

Nonlinear terms (spherical transform) is evaluated twice for each time step.
Elapsed time is evaluated by inserting MPI_wtime() in main.f90
Time is evaluated from average over 100 time steps.

Definition of columns

name
# of Cores	Number of used CPU cores
# of Processes	Number of MPI processes
# of PE for $r$	Number of MPI processes in radial
$l_{max}$	Truncation lavel for spherical harmonincs
$(N_{r},N_{\theta},N_{\phi})$	Nuber of grids in spherical coordinate
Elapsed	Elapsed (wall clock time) for one time step
Nonlinear	Elapsed (wall clock time) for evaluation of nonlinear terms
Solver	Elapsed (wall clock time) for linear solver (including communications)
Efficiency	Parallel efficiency
SUs	Service unit for $10^{4}$ time steps (Core hours)

Single processors Result

$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$	Elapsed	Nonlinear	Solver	SUs
48	( 73,72,144)	0.774053	0.647997	0.109058	34.4023

Strong Scaling Results

$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$
48	(73,72,144)

# Cores	# Processes	# of PE for $r$	Elapsed	Nonlinear	Solver	Efficiency	SUs
1	1	1	0.774053	0.647997	0.109058	1.00	34.4023
2	2	2	0.385259	0.320167	0.0551415	1.00459	17.1226
4	4	2	0.209187	0.174623	0.0281849	0.925072	9.2972
8	8	4	0.119906	0.100789	0.0145561	0.806939	5.32914
16	16	4	0.0788227	0.0672625	0.00788815	0.613761	3.50323

$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$
256	(512,384,768)

# Cores	# Processes	# of PE for $r$	Elapsed	Nonlinear	Solver	Efficiency	SUs
64	64	8	15.1231	14.6107	0.341912	1.00	2688.55
128	128	8	7.7779	7.46396	0.174534	0.972183	2765.47
256	256	16	4.08026	3.85867	0.0918966	0.926600	2901.52
512	512	16	2.22781	2.05299	0.048329	0.848538	3168.45
1024	1024	32	1.3467	1.17748	0.0416313	0.70186	3830.60
2048	2048	32	0.852113	0.689238	0.0302918	0.554617	4847.58
4096	4096	32	0.374556	0.249321	0.0101704	0.501042	4261.62

Elapsed (wall clock) time for the strong scaling for $l_{max} = 256$, $(N_{r}, N_{\theta}, N_{\phi}) = (512,284,768)$ case. Number of subdomain in the radial direction is shown by the numbers. Ideal scaling is plotted by dotted line.

Parallel Efficiency for the strong scaling for $l_{max} = 256$, $(N_{r}, N_{\theta}, N_{\phi}) = (512,284,768)$ case. Number of subdomain in the radial direction is shown by the numbers.

Weak Scaling Results

# Cores	# Processes	# of PE for $r$	$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$	Elapsed	Nonlinear	Solver	SUs
64	64	32	32	(512,48,96)	0.0775688	0.0565	0.00624115	13.79
256	64	32	64	(512,96,192)	0.142538	0.106911	0.00610947	101.361
1024	1024	32	128	(512,192,384)	0.203346	0.137757	0.00715795	578.407
4096	4096	32	256	(512,384,768)	0.374556	0.249321	0.0101704	4261.62

Elapsed (wall clock) time for the weak scaling in the horizontal resolutions. Number of processes for the radial directions is fixed to 32. Ideal scaling for the Legendre transform ($ = O(L_{max}^3$} is plotted by dotted lines.

# Cores	# Processes	# of PE for $r$	$l_{max}$	$(N_{r},N_{\theta},N_{\phi})$	Elapsed	Nonlinear	Solver	SUs
256	256	2	255	(32,384,768)	0.221005	0.206752	0.0071895	157.159
512	512	4	255	(64,384,768)	0.231582	0.211148	0.0067766	329.361
1024	1024	8	255	(128,384,768)	0.268191	0.231129	0.0070692	762.854
2048	2048	16	255	(256,384,768)	0.45460	0.374649	0.0152046	2586.17
4096	4096	32	255	(512,384,768)	0.374556	0.249321	0.0101704	4261.62

Elapsed time for the weak scaling in the radial resolution. Number of processes for the spherical harmonics is fixed to 128. Ideal scaling is plotted by dotted lines.

Back to performance benchmark lists

Table of Contents