Objective: To design and implement a prototype database to facilitate retrieval of subsets of large scientific data sets. The data subsets are anticipated to be used as inputs to other codes, such as for visualization or MDO.
Approach: Currently many scientist store and retrieve data sets as files. When the scientist is interested in a subset of the data they read in the entire data set and strip out the portion of interest. This is a viable option when working with files of tens of megabytes, but is not practical when data sets are two or three orders of magnitude larger. Many Langley CFD scientists indicate they will soon be producing gigabyte data sets.
Our approach is to provide database support to retrieve only those pages from disk that contain the desired data. The scientist queries the data base with the specified region and is returned only the desired data. Typical CFD data sets are two or three dimensional, thus we provide a multi-attribute indexing technique.
Accomplishment: There are several multi-attribute indexing techniques proposed in the database literature and no conclusive study had been done comparing them. A substantial portion of our time was spent conducting a comparison study of the three most accepted multi-attribute indexing techniques: gridfiles, R-trees, and R*-trees. Our initial system used the Exodus system, an academic object-oriented database system from the University of Wisconsin, as a database engine. Implementation work by student Adrian Filipi-Martin showed the gridfile to be the best for our purposes.
The overhead of the Exodus system, plus lack of control over logging, necessitated the development of our own database system. Our system currently consists of a catalog which keeps track of various meta data associated with a data set and a gridfile holding the data for each data set. Our database system is significantly faster for both loading and querying the database than our earlier system using Exodus.
Our prototype currently supports loading and querying of three dimensional single block of a block structured CFD grid. We have been interacting with Mark Sanetrik who has supplied us with a data set. The coordinate system of the data is an O-grid where the third coordinate is time.
In collaboration with David Nicol we have begun addressing unstructured CFD grids by developing a bulk loading algorithm. We have developed and implemented a rectilinear partitioning algorithm (see Figure 1) and compared the computation time of our algorithm with a recently proposed dynamic programming algorithm. As shown in Figure 2, our algorithm is three to four orders of magnitude faster for small data sets. Our algorithm has not yet been incorporated into our prototype.
Significance: Our approach can speed up user retrieval of subsets by many orders of magnitude depending on data set size. Our file based database allows the system to be run on any UNIX platform supporting g++. The utility of this software will increase when the workstation based mass storage servers are installed at ISD since use of our system can decrease the amount of data being shipped from the servers to the clients. In addition, our system would be a logical component of the software for the MDO project where application codes need subsets of data produced by other application codes.
Status/Plans: We will continue development of the prototype by first incorporating our bulk load algorithm and comparing it with the conventional load technique of inserting one tuple at a time. Incorporation of the bulk load algorithm will allow the prototype to store the data portion of unstructured grids. Work needs to be done to determine the best way to store the grid structure. Extension of the prototype to multiple block structured grids and other formats is dependent on further funding.
Point of Contact: