Computational Infrastructure for Geodynamics Wiki

Modules

shtns: gcc/4.7.1 mkl/13.0.2.146

xshells: intel/13.0.2.146 impi/4.1.0.030

Using MKL FFTW wrappers

Flags

shtns: ./configure –enable-mkl ; make

xshells: mpiicpc -mt_mpi -O3 -march=native -xHost -complex-limited-range -ipo -prec-div -prec-sqrt -DXS_MKL -DXS_VEC=0 -DXS_MPI -fopenmp -Wunknown-pragmas -lshtns -mkl -lrt -lm -o xsbig_hyb

Single Node, Strong Scaling

Run xsbig for 200 iterations controlling threads with OMP_NUM_THREADS environment variable. All times are in seconds.

Cores, Processes, Threads			Problem Description		Timing (seconds)				Metrics
C	P	T	$ l_{max} $	$ (N_r,N_{\theta},N_{\phi}) $	Total	Solver	Nonlinear	Comm	Efficiency	SUs per $10^4$ iters	Hours per $10^4$ Iters
1	1	1	47	(73,72,144)				0
2	1	2	47	(73,72,144)				0
4	1	4	47	(73,72,144)				0
8	1	8	47	(73,72,144)				0
16	1	16	47	(73,72,144)				0
16	1	32	47	(73,72,144)				0
16	1	64	47	(73,72,144)				0

Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers.

Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.

Multiple Nodes, Strong Scaling

Note: Decomposes by radial shell, so only scales up $N_r$ cores.

Cores, Processes, Threads			Problem Description		Timing (seconds)				Metrics
C	P	T	$ l_{max} $	$ (N_r,N_{\theta},N_{\phi}) $	Total	Solver	Nonlinear	Comm	Efficiency	SUs per $10^4$ iters	Hours per $10^4$ Iters
16	1	16	255	(512,384,768)
16	2	8	255	(512,384,768)
16	4	4	255	(512,384,768)
16	8	2	255	(512,384,768)
16	16	1	255	(512,384,768)
32	2	16	255	(512,384,768)
32	4	8	255	(512,384,768)
32	8	4	255	(512,384,768)
32	16	2	255	(512,384,768)
32	32	1	255	(512,384,768)
64	4	16	255	(512,384,768)
64	8	8	255	(512,384,768)
64	16	4	255	(512,384,768)
64	32	2	255	(512,384,768)
64	64	1	255	(512,384,768)
128	8	16	255	(512,384,768)
128	16	8	255	(512,384,768)
128	32	4	255	(512,384,768)
128	64	2	255	(512,384,768)
128	128	1	255	(512,384,768)
256	16	16	255	(512,384,768)
256	32	8	255	(512,384,768)
256	64	4	255	(512,384,768)
256	128	2	255	(512,384,768)
256	256	1	255	(512,384,768)
512	32	16	255	(512,384,768)
512	64	8	255	(512,384,768)
512	128	4	255	(512,384,768)
512	256	2	255	(512,384,768)

Multiple Nodes, Weak Scaling

Cores, Processes, Threads			Problem Description		Timing (seconds)
C	P	T	$ l_{max} $	$ (N_r,N_{\theta},N_{\phi}) $	Total	Solver	Nonlinear	Comm
16	16	1	31	(512,48,96)
16	8	2	31	(512,48,96)
16	4	4	31	(512,48,96)
16	2	8	31	(512,48,96)
16	1	16	31	(512,48,96)
32	32	1	44	(512,68,136)
32	16	2	44	(512,68,136)
32	8	4	44	(512,68,136)
32	4	8	44	(512,68,136)
32	2	16	44	(512,68,136)
64	64	1	63	(512,96,192)
64	32	2	63	(512,96,192)
64	16	4	63	(512,96,192)
64	8	8	63	(512,96,192)
64	4	16	63	(512,96,192)
256	256	1	127	(512,192,384)
256	128	2	127	(512,192,384)
256	64	4	127	(512,192,384)
256	32	8	127	(512,192,384)
256	16	16	127	(512,192,384)

Elapsed (wall clock) time for the weak scaling in the horizontal resolutions. Number of OpenMP threads are shown by the numbers. Ideal scaling for Legendre transform ($O(N_{core}^{1/2})$) is plotted by dotted lines.

Multiple Nodes, Radial Weak Scaling

Cores, Processes, Threads			Problem Description		Timing (seconds)
C	P	T	$ l_{max} $	$ (N_r,N_{\theta},N_{\phi}) $	Total	Solver	Nonlinear	Comm
128	128	1	255	(256,384,768)
128	64	2	255	(256,384,768)
128	32	4	255	(256,384,768)
128	16	8	255	(256,384,768)
128	8	16	255	(256,384,768)
256	256	1	255	(512,384,768)
256	128	2	255	(512,384,768)
256	64	4	255	(512,384,768)
256	32	8	255	(512,384,768)
256	16	16	255	(512,384,768)
512	512	1	255	(1024,384,768)
512	256	2	255	(1024,384,768)
512	128	4	255	(1024,384,768)
512	64	8	255	(1024,384,768)
512	32	16	255	(1024,384,768)
1024	1024	1	255	(2048,384,768)
1024	512	2	255	(2048,384,768)
1024	256	4	255	(2048,384,768)
1024	128	8	255	(2048,384,768)
1024	64	16	255	(2048,384,768)
2048	2048	1	255	(4096,384,768)
2048	1024	2	255	(4096,384,768)
2048	512	4	255	(4096,384,768)
2048	256	8	255	(4096,384,768)
2048	128	16	255	(4096,384,768)
4096	2048	2	255	(8192,384,768)
4096	1024	4	255	(8192,384,768)
4096	512	8	255	(8192,384,768)
4096	256	16	255	(8192,384,768)

Elapsed (wall clock) time for the weak scaling in the radial resolutions. Number of OpenMP threads are shown by the numbers. $O(N_{core}^{1/2})$ scaling is plotted by dotted line.

Back to performance benchmark lists

files

Computational Infrastructure for Geodynamics Wiki

Sidebar

Table of Contents

Modules

Flags

Single Node, Strong Scaling

Multiple Nodes, Strong Scaling

Multiple Nodes, Weak Scaling

Multiple Nodes, Radial Weak Scaling

Computational Infrastructure for Geodynamics Wiki

User Tools

Site Tools

Sidebar

Table of Contents

Modules

Flags

Single Node, Strong Scaling

Multiple Nodes, Strong Scaling

Multiple Nodes, Weak Scaling

Multiple Nodes, Radial Weak Scaling

Page Tools