Introduction
The FLASH I/O benchmark routine measures the performance
of the FLASH parallel HDF 5 output. It recreates the primary
data structures in FLASH and produces a checkpoint file,
a plotfile with centered data, and a plotfile with corner
data. The plotfiles have single precision data.
The purpose of this routine is to tune the I/O performance in
a controlled environment. The I/O routines are identical to
the routines used by FLASH, so any performance improvements
made to the benchmark program will be shared by FLASH.
Makefiles for the ASCI Red (TFLOPS) machine at SNL, ASCI Blue
Pacific / Frost at LLNL, and SGI platforms are included. Information
on the performance and difficulties encountered with parallel HDF 5
will be posted on this page.
Current Issues
Below are some issues that came up at a meeting with the NCSA HDF
developers. These are issues that we hope to address with the
FLASH I/O benchmark program.
It was suggested by the HDF team that setting the data transfer
property list to use collective I/O (as is done on Red) will do the
necessary compound object creation. The pablo instrumentation should
confirm whether this is indeed true. Additionally, there are a few parameters that we don't know how to
set -- one is telling ROMIO that 2-phase I/O might be a good idea.
This is most likely controlled by creating an MPI Info object and
passing that through HDF 5 to the MPI_File_open command. We don't
know what to set though. This should be easy to find out. Passing
an info object to the underlying MPI-IO layer is already done in
the HDF 5 module in FLASH when we compile on TFLOPS. The second
parameter is the number of nodes that do the writing.
Summary of a meeting with Richard Hedges and company of the parallel I/O
project at LLNL this past week to discuss I/O performance of FLASH on ASCI
Blue Pacific:
An older version of the library used to work fine with IBM MPI,
so it should be possible in theory to get around this.
There is no plan to have a version of the library linked with
MPICH (which does not show this bug).
(7-1-01) This problem has been fixed. Blue
Pacific has been upgraded to the Mohonk version of the IBM MPI
library. Adding -bmaxdata:0x80000000 to the end of the link line
also helps fix this problem. I have head no problems on Blue
Pacific creating files as large as 4 GB.
The tracing that Richard performed showed that alignment was
hurting us. He suggested that we can get a 2-4x kick if we
properly aligned.
According to
the HDF 5 1.5.6 Release Notes this problem has been recognized
by the developers and fixed. I have not yet had the chance to
test this.
We pay a cost for each I/O write, and some of these records
have only one number (a single real) that is written. Packing
these either manually, or somehow instructing the library to
buffer a bunch of records before writing to disk may help the
performance.
Richard is going to look into fixes for the memory bug that is preventing
us from writing from a large number of processors. I will look into
packing the first few records into a single record. Finally, we need to
figure out why the alignment is not working. Apparantely, Kim Yates had
it working in an earlier version of the library, and saw good results, but
at some point (~ HDF 5 1.2.2?) it stopped working.
View the README file included with the
I/O benchmark.
Download
Download the benchmark routine
Currently, this benchmark routine uses the following libraries
on the ASCI platforms
History
Fixed a missing dimension definition (N_DIM) in the C routines
that was causing some of the smaller records to be defined with the
wrong dimensions.
Packed several of the header records into a single HDF 5 compound
object -- this reduces the number of writes required to store the
data.
Platform
HDF library
compiler version
MPI version
ASCI Red
1.4.0 (parallel)
FORTRAN: if90 Rel 3.1-4i
C: icc Rel 3.1-4iMPICH 1.2.1
ASCI Blue Pacific
1.4.1
FORTRAN: newmpf90
C: newmpcc IBM MPI
Frost
1.4.1
FORTRAN: newmpf90
C: newmpcc IBM MPI (Mohonk)
added Rob and Dan's Chiba City specific code.
eliminated the hyperslab selection on the memory space for the
unknowns records -- this procedure was very slow, especially on
Chiba City. The interiors of the AMR blocks are now extracted
via a direct memory copy into a buffer array setup in the FORTRAN
routines. This buffer is then passed onto the HDF 5 routines.
changed the Red build to us the release version of the library.
updated the benchmark routine to include plotfiles with and
without corner data.
added some dataset chunking calls. I don't yet know how big of an
effect (if any) these have. Chunking can be enabled by setting
the CHUNK preprocessor directive.
added some MPI_Info hints for the TFLOPS platform.
added platform dependent code, delimited via preprocessor
directives. TFLOPS is the ASCI Red platform,
IBM is ASCI Blue Pacific, and SGI is
for a generic SGI machine.
initial version of the HDF 5 v. 1.4 I/O benchmark program
for FLASH.