NASA
High Performance Computing
and Communications Program
Computational AeroSciences Project
Parallelization of Two Computational Fluid Dynamics (CFD) Codes
Objective: Develop a parallel and scalable version of two three-dimensional (3D) Computational Fluid Dynamics (CFD) codes. The codes are Thin-Layer Navier Stokes in Three-Dimensions (TLNS3D) and Computational Fluids Lab 3 Dimensional (CFL3D).
Approach: Both codes are multi-block and thus structured to readily incorporate coarse-grain parallelism, wherein one or more blocks of the global grid are assigned to each processor of a workstation cluster or parallel computer. More than one block may reside on a processor, but coarse granularity implies that a block cannot be split among multiple processors.
Both codes are written to run in either a serial or distributed environment. The user chooses between two underlying message passing libraries, either Message Passing Interface (MPI) or Parallel Virtual Machine (PVM). The MPI library is better suited for distributed-memory parallel supercomputers (MPP's) because it is generally faster. The MPI implementations used to date include IBM's MPI-F on the SP2 and MPICH on clusters. PVM is well accepted and easily installed/maintained on heterogeneous environments. The primary considerations of the parallelization task were: a) minimal code changes to the original sequential version, and b) the capability to generate sequential, distributed, and parallel versions from one code using simple compiler directives.
Accomplishments: To utilize the available high-performance resources, a parallel version of both codes was developed. The parallel implementation of both codes maintained almost all of their original features, yet yielded significant performance speedups when used in a high-performance environment. Linear speedups are approachable if the biggest blocks are partitioned to allow good load (and data) balancing. Table I shows the comparative speeds for the two CFD codes for three High Speed Civil Transport (HSCT) problems executed on various machines. The distributed versions display a near-linear speedup with the number of processors on the IBM SP2, Intel Paragon, and a heterogeneous cluster of workstations (SGI, SUN, etc.).
Significance: The use of coarse-grain parallelism eases the hardship in simulating compressible flow situations with CFD that require multiple blocks involving millions of grid points. This work demonstrates that linear speedups are approachable if the biggest blocks can be partitioned to allow good load (and data) balancing.
Status/Plans: Future work will include a study of the feasibility of fine-grain parallelism, wherein one block may be shared between two or more processors, into CFL3D. Such a procedure will split the biggest block of a problem containing non-uniform blocks between two or more processors, enabling better load-balancing. A judicious combination of fine- and coarse-grained parallelism will be required to tackle the complex fluid-flow problems that will engage scientists and researchers in the years ahead.
Contact:
Dr. Chen-Huei Liu
NASA Langley Research Center
Mail Stop 128
c.liu@larc.nasa.gov
(757) 864-2154
Tables I. CFD Code Speed-Up and Comparative Timing
Three test cases were selected to compare the two CFD codes: Test Case 1 RefH wing-body-nacelle/pylon/tail; 1 million grid points; 26 blocks Test Case 2 RefH grid; 730,000 grid points; 66-blocks c.g. Test Case 3 HSCT High-lift (landing configuration); 4.7 million grid points wing/body/nacelle/diverter/engine power-off); 66 blocks For Test Cases 1 and 2, the serial versions of the CFD codes were run on the Cray YMP, whereas the parallel versions were run on the IBM SP2 with 16 nodes. Because of the problem size, the serial versions of Test Case 3 were run on the Cray C90, whereas the parallel versions were run on the IBM SP2 with 37 nodes. The following table shows the speed-up (serial/parallel). Test Case Serial/Parallel CFL3D Speed-up TLNS3D Speed-up 1 Cray YMP/SP2 16 1.85 3.95 2 Cray YMP/SP2 16 3.72 3.27 3 Cray YMP/SP2 37 11.02 * 5.71 * * Cray YMP speed-up times for Test Case 3 are interpolated from Cray C90 serial results. As a detailed example, the following table shows the wall-clock time in seconds per grid iteration for Test Case 2. The first column shows the machine, where the SP2 is shown with various number of nodes. Test Case 2: Computer CFL3D time TLNS3D time Cray YMP 24.223 64.899 Cray C90 5.188 SP2 2 46.596 SP2 4 24.244 50.773 SP2 8 13.516 29.531 SP2 16 6.508 19.842