NASA
High Performance Computing
and Communications Program

Computational AeroSciences Project

Parallelization of Two Computational Fluid Dynamics (CFD) Codes

Objective: Develop a parallel and scalable version of two three-dimensional (3D) Computational Fluid Dynamics (CFD) codes. The codes are Thin-Layer Navier Stokes in Three-Dimensions (TLNS3D) and Computational Fluids Lab 3 Dimensional (CFL3D).

Approach: Both codes are multi-block and thus structured to readily incorporate coarse-grain parallelism, wherein one or more blocks of the global grid are assigned to each processor of a workstation cluster or parallel computer. More than one block may reside on a processor, but coarse granularity implies that a block cannot be split among multiple processors.

Both codes are written to run in either a serial or distributed environment. The user chooses between two underlying message passing libraries, either Message Passing Interface (MPI) or Parallel Virtual Machine (PVM). The MPI library is better suited for distributed-memory parallel supercomputers (MPP's) because it is generally faster. The MPI implementations used to date include IBM's MPI-F on the SP2 and MPICH on clusters. PVM is well accepted and easily installed/maintained on heterogeneous environments. The primary considerations of the parallelization task were: a) minimal code changes to the original sequential version, and b) the capability to generate sequential, distributed, and parallel versions from one code using simple compiler directives.

Accomplishments: To utilize the available high-performance resources, a parallel version of both codes was developed. The parallel implementation of both codes maintained almost all of their original features, yet yielded significant performance speedups when used in a high-performance environment. Linear speedups are approachable if the biggest blocks are partitioned to allow good load (and data) balancing. Table I shows the comparative speeds for the two CFD codes for three High Speed Civil Transport (HSCT) problems executed on various machines. The distributed versions display a near-linear speedup with the number of processors on the IBM SP2, Intel Paragon, and a heterogeneous cluster of workstations (SGI, SUN, etc.).

Significance: The use of coarse-grain parallelism eases the hardship in simulating compressible flow situations with CFD that require multiple blocks involving millions of grid points. This work demonstrates that linear speedups are approachable if the biggest blocks can be partitioned to allow good load (and data) balancing.

Status/Plans: Future work will include a study of the feasibility of fine-grain parallelism, wherein one block may be shared between two or more processors, into CFL3D. Such a procedure will split the biggest block of a problem containing non-uniform blocks between two or more processors, enabling better load-balancing. A judicious combination of fine- and coarse-grained parallelism will be required to tackle the complex fluid-flow problems that will engage scientists and researchers in the years ahead.

Contact:

Dr. Chen-Huei Liu

NASA Langley Research Center

Mail Stop 128

c.liu@larc.nasa.gov

(757) 864-2154

Tables I. CFD Code Speed-Up and Comparative Timing

 

Three test cases were selected to compare the two CFD codes:
	Test Case 1   RefH wing-body-nacelle/pylon/tail; 1 million grid points; 26 blocks
	Test Case 2   RefH grid; 730,000 grid points; 66-blocks c.g. 
	Test Case 3   HSCT High-lift (landing configuration); 4.7 million grid points  
wing/body/nacelle/diverter/engine power-off); 66 blocks
 
For Test Cases 1 and 2, the serial versions of the CFD codes were run on the Cray YMP, whereas the parallel versions were run on the IBM SP2 with 16 nodes. Because of the problem size, the serial versions of Test Case 3 were run on the Cray C90, whereas the parallel versions were run on the IBM SP2 with 37 nodes. The following table shows the speed-up (serial/parallel).
 Test Case                        	Serial/Parallel      	 CFL3D Speed-up	TLNS3D Speed-up
	1   			Cray YMP/SP2 16   		  1.85   		3.95   
	2   			Cray YMP/SP2 16   		  3.72   		3.27   
	3   			Cray YMP/SP2 37		11.02 * 		5.71 * 
 * Cray YMP speed-up times for Test Case 3 are interpolated from Cray C90 serial results.
 As a detailed example, the following table shows the wall-clock time in seconds per grid iteration for Test Case 2. The first column shows the machine, where the SP2 is shown with various number of nodes.
 Test Case 2:		
Computer			CFL3D time		TLNS3D time
Cray YMP 		    24.223   		      64.899
Cray C90                5.188
SP2  2			  46.596
SP2  4    	        24.244    		     50.773
SP2  8    		   13.516    		     29.531
SP2 16    	 	   6.508    		     19.842