Objective: Develop an efficient parallel implementation of the OVERFLOW code on existing Ames test beds. Demonstrate how algorithmic choices affect the efficiency and robustness of a parallel implementation (OVERFLOW is a steady and unsteady compressible Navier-Stokes code that is heavily used on the Cray C90).
Approach: The platform used for this study was the Connection Machine CM-5. Parallel efficiency is affected by many aspects of the standard approximate factorization algorithm. Two are considered here: the treatment of systems of linear equations (generally narrow banded) and the treatment of the turbulence model.
For steady-state computations a variant of the diagonal form of the standard approximate-factorization algorithm is often sufficient. The computational kernel of the implicit part of this algorithm is the solution of scalar tridiagonal systems of equations. In some steady-state or unsteady problems the scalar tridiagonal algorithm is not sufficiently robust or accurate. In these cases a scalar pentadiagonal algorithm or a block tridiagonal algorithm is appropriate. Various implementations of these algorithms were explored, including a static partitioning, in which a given banded system spans many processors, and a transpose strategy in which global data movements are performed to put each banded system on a single processor.
The standard turbulence model for OVERFLOW is the Baldwin-Lomax model. This model does not parallelize well (execution time increases by about 30% when using it). A more recent development, the Spalart-Allmaras model, has the twin advantages of better physics and easier parallel implementation. The use of this model was explored.
Accomplishment: For several test cases involving scalar pentadiagonal and block tridiagonal solvers, the ''transpose'' code ran 20-40% faster than the non-transpose code, even though the ''transpose'' code spent 25-40% of its time doing the transposes. For other cases involving scalar tridiagonals, the transpose strategy was less efficient than the static partitioning. For that case, 32 nodes of the CM-5 is roughly half the speed of one Cray C90 processor. The Spalart-Allmaras one-equation turbulence model was introduced into the code. The code with the Spalart-Allmaras model was run on a one-zone wing test case and on a 6-zone wing/body problem. For the one-zone wing, the code with the Spalart-Allmaras model was about 12% faster than the code with the Baldwin-Lomax model.
Significance: This work directly applies to milestone 2.3.1. When the time comes to use OVERFLOW as part of a multidisciplinary optimization, the efficiency of the code becomes crucial. This work carefully explored two ways of making the code more efficient on parallel computers.
Status/Plans: Work on 2.3.1 is complete. Future plans include work on Optimization. For this task, the code will be translated from Connection Machine Fortran (CMF) to High Performance Fortran (HPF) and tested on a variety of platforms, including the IBM SP-2 when an HPF compiler is available (projected first quarter 1995). HPF is similar to CMF, so the language transition should be smooth. The difficult part will be in dealing with numerical software libraries, which exist on the CM-5 but are not yet available on the SP-2.
Point of Contact: