User Tools

Site Tools


wg:dynamo:performance_results:xshells

Back to performance benchmark lists

Modules

shtns: gcc/4.7.1 mkl/13.0.2.146

xshells: intel/13.0.2.146 impi/4.1.0.030

Using MKL FFTW wrappers

Flags

shtns: ./configure –enable-mkl ; make

xshells: mpiicpc -mt_mpi -O3 -march=native -xHost -complex-limited-range -ipo -prec-div -prec-sqrt -DXS_MKL -DXS_VEC=0 -DXS_MPI -fopenmp -Wunknown-pragmas -lshtns -mkl -lrt -lm -o xsbig_hyb

Single Node, Strong Scaling

Run xsbig for 200 iterations controlling threads with OMP_NUM_THREADS environment variable. All times are in seconds.

Cores,
Processes,
Threads
Problem
Description

Timing (seconds)

Metrics
C P T $ l_{max} $ $ (N_r,N_{\theta},N_{\phi}) $ Total Solver Nonlinear Comm Efficiency SUs per $10^4$ iters Hours per $10^4$ Iters
1 1 1 47 (73,72,144) 0
2 1 2 47 (73,72,144) 0
4 1 4 47 (73,72,144) 0
8 1 8 47 (73,72,144) 0
16 1 16 47 (73,72,144) 0
16 1 32 47 (73,72,144) 0
16 1 64 47 (73,72,144) 0


Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers.


Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.

Multiple Nodes, Strong Scaling

Note: Decomposes by radial shell, so only scales up $N_r$ cores.

Cores,
Processes,
Threads
Problem
Description

Timing (seconds)

Metrics
C P T $ l_{max} $ $ (N_r,N_{\theta},N_{\phi}) $ Total Solver Nonlinear Comm Efficiency SUs per $10^4$ iters Hours per $10^4$ Iters
16 1 16 255 (512,384,768)
16 2 8 255 (512,384,768)
16 4 4 255 (512,384,768)
16 8 2 255 (512,384,768)
16 16 1 255 (512,384,768)
32 2 16 255 (512,384,768)
32 4 8 255 (512,384,768)
32 8 4 255 (512,384,768)
32 16 2 255 (512,384,768)
32 32 1 255 (512,384,768)
64 4 16 255 (512,384,768)
64 8 8 255 (512,384,768)
64 16 4 255 (512,384,768)
64 32 2 255 (512,384,768)
64 64 1 255 (512,384,768)
128 8 16 255 (512,384,768)
128 16 8 255 (512,384,768)
128 32 4 255 (512,384,768)
128 64 2 255 (512,384,768)
128 128 1 255 (512,384,768)
256 16 16 255 (512,384,768)
256 32 8 255 (512,384,768)
256 64 4 255 (512,384,768)
256 128 2 255 (512,384,768)
256 256 1 255 (512,384,768)
512 32 16 255 (512,384,768)
512 64 8 255 (512,384,768)
512 128 4 255 (512,384,768)
512 256 2 255 (512,384,768)

Multiple Nodes, Weak Scaling

Cores,
Processes,
Threads
Problem
Description

Timing (seconds)
C P T $ l_{max} $ $ (N_r,N_{\theta},N_{\phi}) $ Total Solver Nonlinear Comm
16 16 1 31 (512,48,96)
16 8 2 31 (512,48,96)
16 4 4 31 (512,48,96)
16 2 8 31 (512,48,96)
16 1 16 31 (512,48,96)
32 32 1 44 (512,68,136)
32 16 2 44 (512,68,136)
32 8 4 44 (512,68,136)
32 4 8 44 (512,68,136)
32 2 16 44 (512,68,136)
64 64 1 63 (512,96,192)
64 32 2 63 (512,96,192)
64 16 4 63 (512,96,192)
64 8 8 63 (512,96,192)
64 4 16 63 (512,96,192)
256 256 1 127 (512,192,384)
256 128 2 127 (512,192,384)
256 64 4 127 (512,192,384)
256 32 8 127 (512,192,384)
256 16 16 127 (512,192,384)


Elapsed (wall clock) time for the weak scaling in the horizontal resolutions. Number of OpenMP threads are shown by the numbers. Ideal scaling for Legendre transform ($O(N_{core}^{1/2})$) is plotted by dotted lines.

Multiple Nodes, Radial Weak Scaling

Cores,
Processes,
Threads
Problem
Description

Timing (seconds)
C P T $ l_{max} $ $ (N_r,N_{\theta},N_{\phi}) $ Total Solver Nonlinear Comm
128 128 1 255 (256,384,768)
128 64 2 255 (256,384,768)
128 32 4 255 (256,384,768)
128 16 8 255 (256,384,768)
128 8 16 255 (256,384,768)
256 256 1 255 (512,384,768)
256 128 2 255 (512,384,768)
256 64 4 255 (512,384,768)
256 32 8 255 (512,384,768)
256 16 16 255 (512,384,768)
512 512 1 255 (1024,384,768)
512 256 2 255 (1024,384,768)
512 128 4 255 (1024,384,768)
512 64 8 255 (1024,384,768)
512 32 16 255 (1024,384,768)
1024 1024 1 255 (2048,384,768)
1024 512 2 255 (2048,384,768)
1024 256 4 255 (2048,384,768)
1024 128 8 255 (2048,384,768)
1024 64 16 255 (2048,384,768)
2048 2048 1 255 (4096,384,768)
2048 1024 2 255 (4096,384,768)
2048 512 4 255 (4096,384,768)
2048 256 8 255 (4096,384,768)
2048 128 16 255 (4096,384,768)
4096 2048 2 255 (8192,384,768)
4096 1024 4 255 (8192,384,768)
4096 512 8 255 (8192,384,768)
4096 256 16 255 (8192,384,768)


Elapsed (wall clock) time for the weak scaling in the radial resolutions. Number of OpenMP threads are shown by the numbers. $O(N_{core}^{1/2})$ scaling is plotted by dotted line.

Back to performance benchmark lists

files

wg/dynamo/performance_results/xshells.txt · Last modified: 2018/11/28 21:55 (external edit)