Back to performance benchmark lists
shtns: gcc/4.7.1 mkl/13.0.2.146
xshells: intel/13.0.2.146 impi/4.1.0.030
Using MKL FFTW wrappers
shtns: ./configure –enable-mkl ; make
xshells: mpiicpc -mt_mpi -O3 -march=native -xHost -complex-limited-range -ipo -prec-div -prec-sqrt -DXS_MKL -DXS_VEC=0 -DXS_MPI -fopenmp -Wunknown-pragmas -lshtns -mkl -lrt -lm -o xsbig_hyb
Run xsbig for 200 iterations controlling threads with OMP_NUM_THREADS environment variable. All times are in seconds.
Cores, Processes, Threads | Problem Description | Timing (seconds) | Metrics |
||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
C | P | T | $ l_{max} $ | $ (N_r,N_{\theta},N_{\phi}) $ | Total | Solver | Nonlinear | Comm | Efficiency | SUs per $10^4$ iters | Hours per $10^4$ Iters |
1 | 1 | 1 | 47 | (73,72,144) | 0 | ||||||
2 | 1 | 2 | 47 | (73,72,144) | 0 | ||||||
4 | 1 | 4 | 47 | (73,72,144) | 0 | ||||||
8 | 1 | 8 | 47 | (73,72,144) | 0 | ||||||
16 | 1 | 16 | 47 | (73,72,144) | 0 | ||||||
16 | 1 | 32 | 47 | (73,72,144) | 0 | ||||||
16 | 1 | 64 | 47 | (73,72,144) | 0 |
Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers.
Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.
Note: Decomposes by radial shell, so only scales up $N_r$ cores.
Cores, Processes, Threads | Problem Description | Timing (seconds) | Metrics |
||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
C | P | T | $ l_{max} $ | $ (N_r,N_{\theta},N_{\phi}) $ | Total | Solver | Nonlinear | Comm | Efficiency | SUs per $10^4$ iters | Hours per $10^4$ Iters |
16 | 1 | 16 | 255 | (512,384,768) | |||||||
16 | 2 | 8 | 255 | (512,384,768) | |||||||
16 | 4 | 4 | 255 | (512,384,768) | |||||||
16 | 8 | 2 | 255 | (512,384,768) | |||||||
16 | 16 | 1 | 255 | (512,384,768) | |||||||
32 | 2 | 16 | 255 | (512,384,768) | |||||||
32 | 4 | 8 | 255 | (512,384,768) | |||||||
32 | 8 | 4 | 255 | (512,384,768) | |||||||
32 | 16 | 2 | 255 | (512,384,768) | |||||||
32 | 32 | 1 | 255 | (512,384,768) | |||||||
64 | 4 | 16 | 255 | (512,384,768) | |||||||
64 | 8 | 8 | 255 | (512,384,768) | |||||||
64 | 16 | 4 | 255 | (512,384,768) | |||||||
64 | 32 | 2 | 255 | (512,384,768) | |||||||
64 | 64 | 1 | 255 | (512,384,768) | |||||||
128 | 8 | 16 | 255 | (512,384,768) | |||||||
128 | 16 | 8 | 255 | (512,384,768) | |||||||
128 | 32 | 4 | 255 | (512,384,768) | |||||||
128 | 64 | 2 | 255 | (512,384,768) | |||||||
128 | 128 | 1 | 255 | (512,384,768) | |||||||
256 | 16 | 16 | 255 | (512,384,768) | |||||||
256 | 32 | 8 | 255 | (512,384,768) | |||||||
256 | 64 | 4 | 255 | (512,384,768) | |||||||
256 | 128 | 2 | 255 | (512,384,768) | |||||||
256 | 256 | 1 | 255 | (512,384,768) | |||||||
512 | 32 | 16 | 255 | (512,384,768) | |||||||
512 | 64 | 8 | 255 | (512,384,768) | |||||||
512 | 128 | 4 | 255 | (512,384,768) | |||||||
512 | 256 | 2 | 255 | (512,384,768) |
Cores, Processes, Threads | Problem Description | Timing (seconds) |
||||||
---|---|---|---|---|---|---|---|---|
C | P | T | $ l_{max} $ | $ (N_r,N_{\theta},N_{\phi}) $ | Total | Solver | Nonlinear | Comm |
16 | 16 | 1 | 31 | (512,48,96) | ||||
16 | 8 | 2 | 31 | (512,48,96) | ||||
16 | 4 | 4 | 31 | (512,48,96) | ||||
16 | 2 | 8 | 31 | (512,48,96) | ||||
16 | 1 | 16 | 31 | (512,48,96) | ||||
32 | 32 | 1 | 44 | (512,68,136) | ||||
32 | 16 | 2 | 44 | (512,68,136) | ||||
32 | 8 | 4 | 44 | (512,68,136) | ||||
32 | 4 | 8 | 44 | (512,68,136) | ||||
32 | 2 | 16 | 44 | (512,68,136) | ||||
64 | 64 | 1 | 63 | (512,96,192) | ||||
64 | 32 | 2 | 63 | (512,96,192) | ||||
64 | 16 | 4 | 63 | (512,96,192) | ||||
64 | 8 | 8 | 63 | (512,96,192) | ||||
64 | 4 | 16 | 63 | (512,96,192) | ||||
256 | 256 | 1 | 127 | (512,192,384) | ||||
256 | 128 | 2 | 127 | (512,192,384) | ||||
256 | 64 | 4 | 127 | (512,192,384) | ||||
256 | 32 | 8 | 127 | (512,192,384) | ||||
256 | 16 | 16 | 127 | (512,192,384) |
Elapsed (wall clock) time for the weak scaling in the horizontal resolutions. Number of OpenMP threads are shown by the numbers. Ideal scaling for Legendre transform ($O(N_{core}^{1/2})$) is plotted by dotted lines.
Cores, Processes, Threads | Problem Description | Timing (seconds) |
||||||
---|---|---|---|---|---|---|---|---|
C | P | T | $ l_{max} $ | $ (N_r,N_{\theta},N_{\phi}) $ | Total | Solver | Nonlinear | Comm |
128 | 128 | 1 | 255 | (256,384,768) | ||||
128 | 64 | 2 | 255 | (256,384,768) | ||||
128 | 32 | 4 | 255 | (256,384,768) | ||||
128 | 16 | 8 | 255 | (256,384,768) | ||||
128 | 8 | 16 | 255 | (256,384,768) | ||||
256 | 256 | 1 | 255 | (512,384,768) | ||||
256 | 128 | 2 | 255 | (512,384,768) | ||||
256 | 64 | 4 | 255 | (512,384,768) | ||||
256 | 32 | 8 | 255 | (512,384,768) | ||||
256 | 16 | 16 | 255 | (512,384,768) | ||||
512 | 512 | 1 | 255 | (1024,384,768) | ||||
512 | 256 | 2 | 255 | (1024,384,768) | ||||
512 | 128 | 4 | 255 | (1024,384,768) | ||||
512 | 64 | 8 | 255 | (1024,384,768) | ||||
512 | 32 | 16 | 255 | (1024,384,768) | ||||
1024 | 1024 | 1 | 255 | (2048,384,768) | ||||
1024 | 512 | 2 | 255 | (2048,384,768) | ||||
1024 | 256 | 4 | 255 | (2048,384,768) | ||||
1024 | 128 | 8 | 255 | (2048,384,768) | ||||
1024 | 64 | 16 | 255 | (2048,384,768) | ||||
2048 | 2048 | 1 | 255 | (4096,384,768) | ||||
2048 | 1024 | 2 | 255 | (4096,384,768) | ||||
2048 | 512 | 4 | 255 | (4096,384,768) | ||||
2048 | 256 | 8 | 255 | (4096,384,768) | ||||
2048 | 128 | 16 | 255 | (4096,384,768) | ||||
4096 | 2048 | 2 | 255 | (8192,384,768) | ||||
4096 | 1024 | 4 | 255 | (8192,384,768) | ||||
4096 | 512 | 8 | 255 | (8192,384,768) | ||||
4096 | 256 | 16 | 255 | (8192,384,768) |
Elapsed (wall clock) time for the weak scaling in the radial resolutions. Number of OpenMP threads are shown by the numbers. $O(N_{core}^{1/2})$ scaling is plotted by dotted line.