Data Analysis and Knowledge Discovery in Geophysical Databases

Objective: As we move toward an era where massive scientific data sets are stored on computer systems, the opportunities for exploiting this data for scientific investigtation are enormous. To capitalize on these opportunities, new data analysis techniques and data query facilities are required. The goal of this project is to develop the means for extracting characteristic patterns and features from massive data sets and provide the query capability for content based, intellegent access to the underlying data.

The objective of this project is to develop the enabling technology for scientific data management systems together with the know-how required for their most effective deployment in geophysical data analysis and scientific exploration. Toward this goal, we are have designed and are implementing a system prototype and will be executing several application experiments to validate and tune the prototype and demonstrate its uses in support of scientific research in geophysics.

Approach: An important part of the project is the development of an interface for the expression of: (1) complex queries, (2) feature templates used for feature-extraction from the database, (3) "typical" temporal and spatial behavioral patterns of the pre-defined features of interest, and (4) monitoring daemons for observing and tracking unexpected object behavior. That is, the language will allow the scientist to express a complex set of constraints over the spatial and temporal database, as well to specify the desired action if a constraint is violated.

The basis of such a language, and its implementation is a data model which encompasses the types of data that scientists deal with (multidimensional arrays, 2D and 3D shapes that evolve over time, etc.) and operations on these data types. For the massive datasets that earth scientists must deal with today, the software execution environment for this language must also facilitate parallel execution so that the computationally demanding tasks can be completed in a timely fashion.

Accomplishments: We have developed a data model for scientific databases and the complex queries required in the earth science applications. An execution engine that can efficiently execute complex queries on parallel platforms. Initial experiments have been run on a variety of machines including the IBM SP1 and SP2 and Intel Paragon demonstrating the efficacy of the approach. Several application studies have begun. For example, we have used our system for extracting phenomenologocal features such as cyclone tracks and blocking conditions from Atmospheric Global Climate Model (AGCM) output.

Significance: The nature of exploratory data analysis for scientific hypothesis corroboration or phenomenon detection is basically an iterative, successive-refinement process. The scientist initially applies a coarse model on the data, and then uses the outcome of this first experiment to refine his/her model and methods; then the process is repeated until the hypothesis is dropped or it is refined into one that is fully corroborated by the collected data. For such investigations to be practical, the scientist must have at hand a powerful system that supports (1) the easy formulation of powerful queries and discriminant decision rules against the database, (2) a natural representation of the relationships of the scientific domain of interest (e.g, in natural domains of the space and time, but, possibly, in the frequency domain as well) (3) efficient execution of these queries without requiring the scientist to become cognizant of the storage structures and processing strategies involved.

Once methods for detecting patterns of interest have been established, the system can search for these patterns as new data is added from sensors and satellites, and through a trigger-based activation mechanism alerts interested scientists. Furthermore, the system can automatically record as metadata which datasets, algorithms and parameters were used in the experiments, the database becomes the companion logbook of each scientist. As scientific theories are revised and improved, the system will help scientists to revise results obtained under old assumptions. Thus, an accelerated growth and sharing of scientific knowledge is expected as result of the research here proposed.

Status/Plans: We have implemented a parallel query processing and scientific data processing system in a portable parallel programming environment. Experiments have been run on a variety of machines including the IBM SP1 and SP2 and Intel Paragon demonstrating the efficacy of the approach. Currently we have obtained the datasets and have begun to use the query processing system for several scientific applications. One is to systematically compare the output of many global climate models in terms of the phenominological features produced (storms, blocking conditions, etc.) and their statistics (e.g., density of genesis by spatial region and time of year). We are conducting this particular investigation independently and also in cooperation with the AMIP project at Lawrence Livermore. This is an example of a study that has not been previously feasible and which is enabled by the software envronment being constructed under this project.

We are also developing a benchmark representative of data analysis and data mining applications. We will be using this benchmark in the coming year to compare various supercomputers, workstation farms and parallel I/O systems in particular. Our execution environment is portable across a wide variety of systems and forms a convenient basis for such a study.

Principal Investigator Progress Metric(s)

Point of Contact:

Return to the PREVIOUS PAGE

curator: Larry Picha (lpicha@cesdis.gsfc.nasa.gov)