Testbeds

System Performance Evaluation Project

Objective

The System Performance Evaluation project works with the large-scale science simulation codes produced by the nine ESS Grand Challenge Investigator teams and with the ESS testbed computer systems. Our objective is to understand the behavior of these codes on the massively parallel testbed computer systems and, to a lesser extent, to understand their behavior on other parallel systems such as the CESDIS Beowulf systems. We expect to work with about 10 to 15 different science codes in total. Our interest is in using measurement tools to understand how these large science codes stress the parallel system and how the parallel system responds to these stresses, as illustrated by the graphic below.

In particular, we wish to find ways to:

Quantify the stresses produced by the science codes on the testbed hardware and software.
Quantify the performance responses produced by the system.
Determine the causes of the observed responses in the codes and systems.
Use the results to improve codes and systems.
Develop new performance evaluation and prediction methods and tools as needed.

The results of this work will be published in various journals and conference proceedings.

Approach

Our approach is to use the science codes as they are submitted by the science teams to meet performance milestones. We use various measurement tools to understand the static structure of each code and its dynamic behavior when executed with a typical data set (also provided by the science team). Typically, a code is instrumented to collect the desired statistics and timings and then run on the testbed system using various numbers of processing nodes. The results are analyzed, and if more data is required, the instrumentation is modified and the code rerun.

The insights gained from this research on a particular code often lead to understandings about how to improve the performance of the code. These insights are fed back to the science team to aid them in further development of the code. Results may also be useful to SGI/Cray in improving their hardware and software systems, so results are often forwarded to the inhouse SGI/Cray team and the inhouse computational scientists.

Measurements of Interest

Part of the research effort is to determine what aspects of science code structure and behavior have the greatest effect on performance. To this end, we are measuring some of the following elements in each code:

Flop/s counts and rates.
Timings and execution counts of interesting code segments.
Data flows between code segments.
MPI/shmem/PVM message passing and synchronization profiles.
I/O activity profiles.
Cache use issues.
Storage allocation sizes and use profiles.
Scaling with problem size and number of processors.
Load balance.

Tools Used

These studies use a variety of tools for instrumenting and measuring various characteristics of the science codes and their behavior. The primary tool to date has been a software system called Godiva (GODdard Instrumentation Visualizer and Analyzer), developed by this project. We also use the SGI/Cray Apprentice and PAT software tools on the CRAY T3E and are investigating other tools from universities and national laboratories that might prove of use, such as PABLO from the University of Illinois and AIMS from NASA Ames Research Center.

Accomplishments

The year was primarily spent on development of the Godiva software system and on preparatory work to understand the research issues involved. The Godiva software now runs on the CRAY T3E and T3D, the CESDIS Beowulf cluster machines, and Sun workstations.

Initially, we are making a pass through all of the science codes, as submitted to meet the 10 Gigaflop/s milestone, in order to understand the various code designs and the instrumentation and measurement issues involved. Four codes have received serious study to date, with a more cursory look at two others. Issues studied have included code size (codes range up to 50,000 lines in size, making many forms of instrumentation difficult), language used (codes so far have included FORTRAN 77, C, and FORTRAN 90), cache use in key loops, parallel communication and synchronization (using the MPI, PVM, and shmem libraries), and flop/s rates in selected code segments. In several cases, the Godiva software was extended to allow new forms of measurement and display.

Several studies of aspects of the NAS Parallel Benchmarks have also been made in order to develop methodology and to understand research issues in these well-known benchmark codes.

No major studies have been completed, but these preliminary studies have resulted in some useful insights and results that have been fed back to the science teams and the SGI/Cray inhouse team. Of particular note:

As a result of study of the CRAY T3E cache behavior, and in collaboration with Spencer Swift of SGI/Cray, we found how to speed up the key computation loop in the TERRA code of John Baumgardner (Los Alamos National Laboratory) from 70 Megaflop/s to 210 Megaflop/s, leading to an overall 20 percent decrease in the execution time of the code.
As a result of study of the MPI profile, we suggested an improved parallel transpose algorithm to Russ Dahlburg (Naval Research Laboratory) for use in his CRUNCH3D code.
As a result of studying the timings of key segments in the treecode of Kevin Olson (Drexel University/Goddard Space Flight Center), we suggested minor changes that resulted in an 82 percent speedup of the code.

A flow of these small scale results will be continuously fed to the science teams as the evaluation project proceeds, with larger, more general insights and understandings packaged as journal and conference papers.

Godiva Software Instrumentation Tool

The Godiva software system, developed as part of this project, has proven to be a useful new tool for the study of large science codes. Using Godiva, a wide variety of aspects of a code may be instrumented so that the dynamic behavior may be observed as the program executes. Of particular importance to date have been the ability to study cache behavior on the CRAY T3E; computation (flop/s) rates in selected code segments, parallel communication and synchronization profiles using MPI, PVM, or shmem library calls; and load balance among processors.

The approach to code instrumentation used in Godiva is as follows. First, selected parts of the code are annotated to study whatever characteristics are of interest. These annotations use a syntax specified in the Godiva Users Manual. Annotations appear as comments to a FORTRAN or C compiler. The annotated code is fed through the Godiva preprocessor, which generates FORTRAN or C source code with calls to the Godiva run-time library inserted at appropriate points. The generated source program is then compiled and linked with the Godiva run-time library. Execution of the program generates a trace file on each processor. The trace file contains statistics collected on-the-fly during execution. Tracing overhead is generally quite low for typical statistics of interest (less than 5 percent additional execution time on most runs, but dependent on the user's choice of data to be gathered). After execution is complete, a Godiva postprocessor is used to generate tables, graphs, and histograms from the trace files produced by the processing nodes.

Currently Godiva supports about 30 different annotation types in the source program. These annotations may be used to generate about 20 different forms of output tables and graphs. For more information about the CESDIS evaluation project and the Godiva software system, visit this set of presentation slides. Included are samples of Godiva annotations and output tables and graphs.

Godiva has been developed as a personal research tool, not intended for general distribution, but it has been made available to other ESS team researchers as appropriate. Because it is a personal research tool, it undergoes frequent change to meet the demands and new directions of the evaluation project.

Significance

By understanding and quantifying the stresses produced on parallel computer systems by large science codes as well as the performance responses of the computer systems, this research is intended to lead to improvements in both the codes and the computer hardware and software systems. The methods used for measurement also are expected to lead to improvements in the techniques of computer benchmarking and performance analysis.

Status/Plans

During FY98, more intensive study of the performance of the large-scale science codes from the Grand Challenge teams will be performed. Some extensions to the Godiva system are anticipated. In addition, a new approach to the generation of benchmarks for parallel computer systems will be investigated.

Point of Contact

Terrence W. Pratt
Center of Excellence in Space Data and Information Sciences (CESDIS)
Goddard Space Flight Center
pratt@cesdis.gsfc.nasa.gov
301-286-0880

Table of Contents | Section Contents -- Testbeds