Back to performance benchmark lists
compile options
F90OPTFLAGS = -O3 -warn all -g -xhost -openmp
Definition of columns
name | |
# of Cores | Number of used CPU cores |
# of Processes | Number of MPI processes |
# of Threads | Number of threads for each process |
$N_{r}$ | Nuber of nodes in radial direction |
$N_{sph}$ | Number of nodes in a sphere |
Elapsed time | Elapsed (wall clock time) for one time step |
Solver time | Elapsed (wall clock time) for linear solver (including communications) |
Comm. time | Elapsed (wall clock time) for data communication |
Efficiency | Parallel efficiency |
SUs | Service unit for $10^{4}$ time steps (Core hours) |
Strong Scaling Results
$N_{r}$ | $N_{sph}$ |
65 | 31106 |
# of Cores | # of Processes | # of Threads | Elapsed time | Solver time | Comm. time | Efficiency | SUs |
32 | 32 | 1 | 140.320 | 136.035 | 5.63721 | 0.993565 | 12472.9 |
64 | 64 | 1 | 62.3732 | 60.2832 | 5.83084 | 1.1176 | 11088.6 |
256 | 256 | 1 | 25.7511 | 22.0492 | 2.21876 | 0.676753 | 18311.9 |
32 | 16 | 2 | 139.417 | 134.658 | 2.90400 | 1.00 | 12392.6 |
64 | 32 | 2 | 61.0636 | 58.9420 | 2.62792 | 1.14157 | 10855.8 |
128 | 64 | 2 | 26.2981 | 25.2837 | 1.96978 | 1.32535 | 9350.44 |
256 | 128 | 2 | 18.6194 | 17.9958 | 8.38487 | 0.935966 | 13240.5 |
512 | 256 | 2 | 18.5565 | 18.1882 | 13.8237 | 0.469569 | 26391.5 |
32 | 8 | 4 | 141.713 | 134.061 | 8.63714 | 0.983798 | 12596.7 |
64 | 16 | 4 | 61.3219 | 58.9329 | 2.28762 | 1.13676 | 10901.7 |
128 | 32 | 4 | 26.7351 | 25.6872 | 1.89901 | 1.30369 | 9505.81 |
256 | 64 | 4 | 12.0594 | 11.5573 | 1.29915 | 1.44511 | 8575.57 |
512 | 128 | 4 | 6.50411 | 6.19388 | 0.930524 | 1.3397 | 9250.29 |
1024 | 256 | 4 | 3.61803 | 3.40549 | 1.27324 | 1.20419 | 10291.3 |
256 | 32 | 8 | 12.6620 | 12.1131 | 1.49653 | 1.37633 | 9004.09 |
512 | 64 | 8 | 6.31781 | 6.05579 | 1.326885 | 1.37921 | 8985.33 |
1024 | 128 | 8 | 4.00455 | 3.83623 | 1.61223 | 1.08796 | 11390.7 |
2048 | 256 | 8 | 2.24999 | 2.17732 | 1.06132 | 0.968178 | 12799.9 |
$N_{r}$ | $N_{sph}$ |
129 | 124418 |
# of Cores | # of Processes | # of Threads | Elapsed time | Solver time | Comm. time | Efficiency | SUs |
512 | 512 | 1 | 188.6315 | 185.9239 | 79.4327 | 0.703179 | 268276 |
512 | 256 | 2 | 155.7737 | 150.9981 | 48.0654 | 0.851503 | 221545 |
1024 | 512 | 2 | 116.1871 | 114.8309 | 68.0962 | 0.570811 | 330488 |
512 | 128 | 4 | 132.6418 | 131.1039 | 13.0843 | 1.00 | 188646 |
1024 | 256 | 4 | 62.7093 | 60.6450 | 14.2009 | 1.05759 | 178373 |
2048 | 512 | 4 | 29.1312 | 28.4619 | 7.7951 | 1.13831 | 165724 |
1024 | 128 | 8 | 60.5976 | 59.1307 | 7.7176 | 1.09445 | 172367 |
2048 | 256 | 8 | 31.4802 | 30.6008 | 8.3909 | 1.05337 | 179087 |
4096 | 512 | 8 | 17.7632 | 17.4155 | 7.1339 | 0.933403 | 202106 |
2048 | 128 | 16 | 129.3082 | 127.3400 | 78.7965 | 0.256445 | 735620 |
4096 | 256 | 16 | 84.6423 | 83.7060 | 60.9705 | 0.195886 | 963041 |
Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers.
Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.
Weak Scaling Results
# of Cores | # of Processes | # of Threads | $N_{r}$ | $N_{sph}$ | Elapsed time | Solver time | Comm. time | iteration for $d{\bf A}/dT$ | SUs |
8 | 4 | 2 | 17 | 1946 | 1.27525 | 0.995820 | 0.051852 | 180.75 | 28.3389 |
16 | 4 | 4 | 33 | 1946 | 2.13157 | 1.89065 | 0.104110 | 229.25 | 94.7364 |
32 | 8 | 4 | 17 | 7778 | 2.91725 | 2.64922 | 0.176285 | 309.0 | 259.311 |
64 | 8 | 8 | 33 | 7778 | 3.29256 | 3.03684 | 0.356617 | 358.5 | 585.344 |
128 | 32 | 4 | 65 | 7778 | 4.02430 | 3.77557 | 0.680323 | 404.25 | 1430.86 |
256 | 32 | 8 | 33 | 31106 | 5.94088 | 5.67232 | 0.916604 | 661.7 | 4224.63 |
512 | 64 | 8 | 65 | 31106 | 6.31781 | 6.05579 | 1.32689 | 683.5 | 8985.33 |
1024 | 256 | 4 | 129 | 31106 | 9.44298 | 9.15932 | 2.59881 | 814.0 | 26860 |
2048 | 256 | 8 | 65 | 124418 | 15.0371 | 14.7224 | 4.4994 | 1354.1 | 85544.4 |
4096 | 512 | 8 | 129 | 124418 | 17.7632 | 17.4155 | 7.1339 | 1355.8 | 202106 |
Elapsed time for the weak scaling (red line). The best results among runs with the same number of cores are chosen for the plotting. Average iteration counts for $dA/Dt$ is also plotted with black line.
Back to performance benchmark lists
files