Collective I/O Techniques for Parallel I/O

Objective: To develop techniques for achieving much higher I/O performance in parallel programs by aggregating physical I/O operations while retaining the user's program logical I/O structure. Many user programs read or write data in small increments, as small as a word at a time. While this pattern may be most natural to the user, carrying out an I/O operation for such small amounts of data incurs huge overheads. When thousands or millions of I/O operations are required, the cost of I/O can become prohibitive.

Approach: Design and implement a combination of compiler and runtime system software that will combine many small I/O requests into comparatively few, large physical I/O operations. By doing so, the overhead of each actual I/O operation will be amortized over many logical (user program level) I/O requests.

Accomplishments: This is a new project that began late in FY96. during this period a preliminary implementation of collective I/O software was implemented on Caltech's Intel Paragon, in the Parallel File System (PFS) portion of the Paragon OS. PFS achieves I/O parallelism by data striping and access modes. Only preliminary evaluation of this implementation has been carried out, using a simple benchmark program. It was found that in some cases collective I/O resulted in an order of magnitude increase in I/O performance.

However, there is an interplay between the collective I/O and the PFS operations, so the I/O cost of the collective I/O routines depends on the file striping characteristics. I/O cost is the actual time for reading or writing data in parallel. The reason for this is that collective I/O improves the I/O performance by minimizing the number of I/O calls made by the processors to the I/O servers. Each I/O call is split into more than one disk call and then serviced by individual servers in parallel. Let R be the request size of the I/O call, D be the number of disks, and S be the stripe size. If the request size R >= (D*S), the collective and non-collective routines require the same number of disk calls. In such cases, the collective and non-collective versions give comparable performance. For example, on our Intel Paragon, for D=64 and S=64 K, if R >= 4 MBytes, the collective and non-collective I/O will take similar time to access data. But if there are 4 processors with R= 64 K, the collective request will have a R= 64*4=256K. In this case, the collective I/O will perform better because the I/O request can be serviced by 4 servers simultaneously.

Significance: This research promises to lead to system software on future high-performance systems that will perform parallel I/O much more efficiently than at present while allowing users to continue programming I/O operations in a manner natural to their applications. Although some knowledge of the underlying hardware and software configuration will be necessary to obtain the biggest gains, the collective I/O approach will shield users from most of the low-level details.

Status/Plans: The design of the collective I/O software will be refined by carrying out many experiments with real applications and different parameter values. In addition, features will be added to support strided accesses to data; strided accesses are a very common pattern in scientific and engineering applications.

Point of Contact:
Dr. Paul Messina
Center for Advanced Computing Research
California Institute of Technology
(818) 395-3907

messina@cacr.caltech.edu