[[wg:dynamo:Performance_results|Back to performance benchmark lists]] \\ ===== compile options ===== F90OPTFLAGS = -O3 -warn all -g -xhost -openmp ===== Notes ===== Nonlinear terms is calculated twice for each step \\ All process have full matrix for all harmonics degree \\ LU decomposition is done for full matrix \\ Time integration is done by a solver for banded matrix \\ ===== Definition of columns ===== ^ name ^ ^ | # of Cores | Number of used CPU cores | | # of Processes | Number of MPI processes | | # of Threads | Number of threads for each process | | $l_{max}$ | Truncation lavel for spherical harmonincs | | $(N_{r},N_{\theta},N_{\phi})$ | Nuber of grids in spherical coordinate | | Elapsed | Elapsed (wall clock) time for one time step | | Nonlinear | Elapsed (wall clock) time for nonlinear terms (including communications) | | Solver | Elapsed (wall clock) time for linear calculation | | Comm. | Elapsed (wall clock) time for data communication | | Init. | Elapsed (wall clock) time for initialization | | Efficiency | Parallel efficiency | | SUs | Service unit for $10^{4}$ time steps (Core hours) | ===== Single Processor Result ===== ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ Elapsed ^ Nonlinear ^ Solver ^ Comm. ^ Init. ^ SU ^ | 47 | ( 73,72,144) | 0.678760 | 0.488277 | 0.190479 | 0.029903 | 2.4152 | 30.1671 | ===== Strong Scaling Results ===== ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ | 255 | (256,384,768) | ^ # of Cores ^ # of Processes ^ # of SMP ^ Elapsed ^ Nonlinear ^ Solver ^ Comm. ^ Init. ^ Efficiency ^ SUs ^ | 64 | 8 | 8 | 6.40703 | 5.89877 | 0.508254 | 1.07863 | 3554.5 | 1 | 1139.03 | | 128 | 16 | 8 | 3.54131 | 3.2940 | 0.247309 | 0.890222 | 3552.51 | 0.904612 | 1259.13 | | 256 | 32 | 8 | 1.86101 | 1.7352 | 0.125808 | 0.475738 | 3550.63 | 0.860692 | 1323.39 | | 1024 | 64 | 8 | 1.04298 | 0.977307 | 0.0656672 | 0.361399 | 3552.23 | 0.383937 | 2966.7 | {{wg:dynamo:Performance_results:TITECH:TITECH_Elapsed.png?480}}\\ Elapsed (wall clock) time for the strong scaling. Ideal scaling is plotted by dotted line. {{wg:dynamo:Performance_results:TITECH:TITECH_efficiency.png?480}}\\ Parallel Efficiency for the strong scaling. ===== Weak Scaling Results ===== ^ # of Cores ^ # of Processes ^ # of SMP ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ Elapsed ^ Nonlinear ^ Solver ^ Comm. ^ Init. ^ SUs ^ | 2 | 1 | 2 | 31 | (256,48,96) | 0.551937 | 0.391925 | 0.160007 | 0.0214999 | 68.4944 | 15.3479 | | 8 | 1 | 8 | 63 | (256,96,192) | 1.12191 | 0.939082 | 0.182825 | 0.0476302 | 92.4576 | 67.197 | | 32 | 4 | 8 | 127 | (256,192,384) | 1.81109 | 1.62482 | 0.186259 | 0.27049 | 271.017 | 360.352 | | 128 | 16 | 8 | 255 | (256,384,768) | 2.62543 | 2.43587 | 0.189552 | 0.624057 | 969.922 | 1490.03 | | 512 | 64 | 8 | 511 | (256,768,1536) | 4.65903 | 4.45921 | 0.199815 | 2.01773 | 3545.45 | 8470.44 | {{wg:dynamo:Performance_results:TITECH:TITECH_weak_sph.png?480}}\\ Elapsed time for the weak scaling in the horizontal resolution. Elapsed time for each time step is plotted by black, and initialization time is plotted by red. Scaling of O(Ncore^1/2) is plotted by dotted lines. \\ ^ # of Cores ^ # of Processes ^ # of SMP ^ $l_{max}$ ^ $(N_{r},N_{\theta},N_{\phi})$ ^ Elapsed ^ Nonlinear ^ Solver ^ Comm. ^ Init. ^ SUs ^ | 96 | 12 | 8 | 383 | (64,576,1152) | 2.76226 | 2.59632 | 0.165937 | 0.509965 | 43.1944 | 736.604 | | 192 | 24 | 8 | 383 | (128,576,1152) | 2.81134 | 2.63732 | 0.174011 | 0.613411 | 484.779 | 1499.38 | | 384 | 48 | 8 | 383 | (256,576,1152) | 3.63671 | 3.44881 | 0.187897 | 1.27930 | 6423.87 | 3879.16 | {{wg:dynamo:Performance_results:TITECH:TITECH_weak_r.png?480}}\\ Elapsed time for the weak scaling in the radial resolution. The results with 4 OpenMP threads are shown.\\ [[wg:dynamo:Performance_results|Back to performance benchmark lists]] \\ [[wg:dynamo:Performance_results:TITECH:files|files]]