[[wg:dynamo:Performance_results|Back to performance benchmark lists]] \\

===== modules for libraries =====

module swap mvapich2 impi/4.1.0.030

(NetCDF4.3.2 is compiled locally excluding HDF5 features)

===== compile options =====

-openmp -O3 -xAVX  -align array32byte

===== Restrictions =====

Number of zonal grids has to be power of 2

===== Definition of columns =====

^  name  ^    ^
|  # of Cores  |  Number of used CPU cores  |
|  # of Processes  |  Number of MPI processes  |
|  # of Threads  |  Number of threads for each process  |
|  $N_{c}$  |  Truncation lavel for Chebyshev polynomial  |
|  $l_{max}$  |  Truncation lavel for spherical harmonincs  |
|  $(N_{r},N_{\theta},N_{\phi})$  |  Nuber of grids in spherical coordinate  |
|  Elapsed  |  Elapsed (wall clock time) for one time step  |
|  Nonlinear  |  Elapsed (wall clock time) for evaluation of nonlinear terms  |
|  Solver  |  Elapsed (wall clock time) for linear solver (including communications)  |
|  Efficiency  |  Parallel efficiency  |
|  SUs  |  Service unit for $10^{4}$ time steps (Core hours)  |


===== Single Processor Result =====

^  $N_{c}$  ^  $l_{max}$  ^  $(N_{r},N_{\theta},N_{\phi})$  ^  Elapsed  ^  Nonlinear  ^  Solver  ^
|  47  |  48  |  (48,64,128)  |  0.445464  |  0.233546  |  0.183094  |

===== Strong Scaling Results =====
In the present test, spatial resolution is fixed, and change the parallelization. 
Elapsed time is inverse proportion to the number of Cores in ideal scaling.

^  $N_{c}$  ^  $l_{max}$  ^  $(N_{r},N_{\theta},N_{\phi})$  ^
|  48  |  85  |  (48,128,256)  |

^  # of Cores  ^  # of Processes  ^  # of Threads  ^  Elapsed  ^  Nonlinear  ^  Solver  ^  Efficiency  ^  SUs  ^
|  1  |  1  |  1  |  2.12702  |  1.13331  |  0.837597  |  1.0  |  15.8769  |
|  2  |  2  |  1  |  1.53197  |  0.771775  |  0.666242  |  0.694213  |  8.32400  |
|  4  |  2  |  2  |  1.41881  |  0.677849  |  0.657223  |  0.374789  |  4.57244  |
|  8  |  2  |  4  |  1.29347  |  0.644326  |  0.586741  |  0.205553  |  2.57067  |
|  16  |  2  |  8  |  1.20088  |  0.577998  |  0.562982  |  0.110701  |  2.57067  |
|  32  |  2  |  16  |  4.45394  |  1.40103  |  2.96418  |  0.0149237  |  2.57067  |
|  4  |  4  |  1  |  1.52251  |  0.61794  |  0.838045  |  0.349261  |  8.32400  |
|  8  |  4  |  2  |  1.308  |  0.573778  |  0.669746  |  0.20327  |  4.57244  |
|  16  |  4  |  4  |  1.19526  |  0.551808  |  0.592631  |  0.111222  |  2.57067  |
|  32  |  4  |  8  |  1.11331  |  0.503031  |  0.560936  |  0.0597045  |  2.57067  |
|  64  |  4  |  16  |  1.11331  |  1.77167  |  3.03814  |  0.00680244  |  2.57067  |
|  8  |  8  |  1  |  4.8857  |  0.681461  |  0.913425  |  0.160511  |  8.32400  |
|  16  |  8  |  2  |  1.46565  |  0.661892  |  0.743852  |  0.0907029  |  4.57244  |
|  32  |  8  |  4  |  1.14545  |  0.505259  |  0.596385  |  0.0580289  |  2.57067  |
|  64  |  8  |  8  |  1.06662  |  0.465445  |  0.562054  |  0.031159  |  2.57067  |
|  8  |  16  |  1  |  3.08775  |  1.58078  |  1.37636  |  0.0430535  |  8.32400  |
|  6  |  16  |  2  |  1.45392  |  0.658729  |  0.74527  |  0.0457173  |  4.57244  |
|  32  |  16  |  4  |  1.13421  |  0.497914  |  0.593822  |  0.029302  |  2.57067  |
|  64  |  16  |  8  |  1.06704  |  0.467572  |  0.561543  |  0.0155732  |  2.57067  |

^  $N_{c}$  ^  $l_{max}$  ^  $(N_{r},N_{\theta},N_{\phi})$  ^
|  97  |  170  |  (145,256,512)  |

^  # of Cores  ^  # of Processes  ^  # of Threads  ^  Elapsed  ^  Nonlinear  ^  Solver  ^  Efficiency  ^  SUs  ^
|  8  |  1  |  8  |  11.3106  |  4.97222  |  5.77018  |  1.0   |  814.222  |
|  16  |  2  |  8  |  9.19364  |  3.65295  |  5.16199  |  1.05368   |  386.372  |
|  32  |  4  |  8  |  8.79569  |  3.32431  |  5.15045  |  1.01463   |  200.621  |
|  64  |  8  |  8  |  8.66073  |  3.21434  |  5.16300  |  0.904766   |  112.491  |
|  128  |  16  |  8  |  8.25554  |  2.83906  |  5.17103  |  0.528483  |  96.2924  |
|  256  |  32  |  8  |  8.50060  |  3.08209  |  5.17536  |  0.528483  |  96.2924  |

{{wg:dynamo:Performance_results:Dennou:SPmodel_elapsed.png?480}}\\
Elapsed (wall clock) time for the strong scaling. Number of OpenMP threads are shown by the numbers. Ideal scaling is plotted by dotted line.

{{wg:dynamo:Performance_results:Dennou:SPmodel_efficiency.png?480}}\\
Parallel Efficiency for the strong scaling. Number of OpenMP threads are shown by the numbers.

===== Weak Scaling Results =====

=== Weak Scaling in horizontal direction ===
In the present benchmark, radial resolution is fixed, and horizontal resolution is increased with parallelization.

^  # of Cores  ^  # of Processes  ^  # of Threads  ^  $N_{c}$  ^  $l_{max}$  ^  $(N_{r},N_{\theta},N_{\phi})$  ^  Elapsed  ^  Nonlinear  ^  Solver  ^  SUs  ^
|  4  |  1  |  4  |  64  |  42  |  (64,64,128)  |  0.307596  |  0.296679  |  13.6710  |
|  16  |  2  |  8  |  64  |  85  |  (64,128,256)  |  0.773182  |  1.00419  |  34.3636  |
|  64  |  8  |  8  |  64  |  170  |  (64,256,512)  |  3.21434  |  5.16300  |  142.860  |
|  256  |  16  |  8  |  64  |  341  |  (64,512,1024)  |  39.123  |  37.1453  |  1738.80  |

{{wg:dynamo:Performance_results:Dennou:spmodel_weak_sph.png?480}}\\
Elapsed (wall clock) time for the weak scaling in horizontal directions. Number of OpenMP threads are shown by the numbers.

===== Note =====
Time is evaluated from average over 100 time steps. \\
Only spherical harmonics transform and evaluation of the nonlinear terms are parallelized \\

[[wg:dynamo:Performance_results|Back to performance benchmark lists]] \\
[[wg:dynamo:Performance_results:Dennou:files|files]]