FLASH I/O benchmark routine -- parallel HDF 5

FLASH I/O Benchmark Routine -- Parallel HDF 5

[Introduction] [Download] [History] [Performance] [Problems]

The FLASH I/O benchmark routine measures the performance of the FLASH parallel HDF 5 output. It recreates the primary data structures in FLASH and produces a checkpoint file, a plotfile with centered data, and a plotfile with corner data. The plotfiles have single precision data.

The purpose of this routine is to tune the I/O performance in a controlled environment. The I/O routines are identical to the routines used by FLASH, so any performance improvements made to the benchmark program will be shared by FLASH.

Makefiles for the ASCI Red (TFLOPS) machine at SNL, ASCI Blue Pacific / Frost at LLNL, and SGI platforms are included. Information on the performance and difficulties encountered with parallel HDF 5 will be posted on this page.

Current Issues

Below are some issues that came up at a meeting with the NCSA HDF developers. These are issues that we hope to address with the FLASH I/O benchmark program.

Guardcells take up a significant portion of the memory on a processor, and since we do not write them out, the fraction of memory written to disk is small.

We store each variable in a separate record. This is both for convienence (it is nice for the vis folks) and because in the plotfiles, the variables we store are not necessarily adjacent in memory (or index). This means some splitting up or union of HDF hyperslabs would have to be done.

I make only a single call to H5Dwrite for each variable stored, even though the data is not contiguous in memory (the guardcells are skipped). We are uncertain what HDF is doing behind the scenes -- it should create a compound MPI object and address the data with that, but it may in fact be issuing a bunch of separate writes. The latter would result in very poor performance.

It was suggested by the HDF team that setting the data transfer property list to use collective I/O (as is done on Red) will do the necessary compound object creation. The pablo instrumentation should confirm whether this is indeed true.

Marty Barnaby has experimented with 2-phase I/O. Basically, this does a collective step across processors in a buffer, building up a large contiguous chuck of memory to write to disk and then it writes it to disk. He found that the performance went up tremendously, from 10-20 MB/s on Red to ~ 90 MB/s. What is not known is whether his ROMIO hacks made it back into the standard ROMIO source (we think the answer is no), whether his ROMIO hacks are in any version of ROMIO we have access to (we are unsure of this one), and whether the HDF 5 on Red is linked with his hacks (we are pretty sure that they are not).

Additionally, there are a few parameters that we don't know how to set -- one is telling ROMIO that 2-phase I/O might be a good idea. This is most likely controlled by creating an MPI Info object and passing that through HDF 5 to the MPI_File_open command. We don't know what to set though. This should be easy to find out. Passing an info object to the underlying MPI-IO layer is already done in the HDF 5 module in FLASH when we compile on TFLOPS. The second parameter is the number of nodes that do the writing.

We are uncertain whether storing all the variables in a single record will result in faster performance than storing them individually. This is something I can easily test though. This is probably out of the question for plotfiles (the reason stated above + the need for corner data), but we could do it for checkpoints.

Collective I/O never seemed to help much on Blue. Perhaps the same 2-phase hack will work there too? This is something I need to test. Blue is also very time dependent, so what works today may not work tomorrow. I will give this some more testing.

We may benefit from split I/O. This basically means storing the meta data that HDF wants in a separate file from the actual data.

The HDF 5 library has no such thing as a `write only' setting. What we mean is that when we are dumping checkpoint files or plotfiles, we know that we are not going to be reading them anytime soon. The library should be able to make some optimizations because of this (this may pay off in combination with the split I/O, as the HDF library may need to do some reading, but we will only be doing writing).

Summary of a meeting with Richard Hedges and company of the parallel I/O project at LLNL this past week to discuss I/O performance of FLASH on ASCI Blue Pacific:

Richard had downloaded my I/O benchmark routine before the meeting and instrumented it to look at what it was doing.

The MPI_File_write_at bug that we hit at 256 processors when using IBM MPI is something that Richard has seen previously. This is due to an inefficiency in how MPI creates MPI types to describe different types of data. The newer version of IBM MPI fixes this bug -- unfortunately, that is only installed on White, and IBM is not planning on backporting it to Blue (or any Blue type machine with more than 128 nodes).

An older version of the library used to work fine with IBM MPI, so it should be possible in theory to get around this.

There is no plan to have a version of the library linked with MPICH (which does not show this bug).

(7-1-01) This problem has been fixed. Blue Pacific has been upgraded to the Mohonk version of the IBM MPI library. Adding -bmaxdata:0x80000000 to the end of the link line also helps fix this problem. I have head no problems on Blue Pacific creating files as large as 4 GB.

The GPFS on White (and presumably the new NH-2 based machine that we will soon have access to) has a block size of 512 kB instead of the 256 kB on Blue.

The alignment settings in HDF have absolutely no effect. Setting them from reasonable values to nothing does not change how the MPI-IO layer aligns data on block boundaries.

The tracing that Richard performed showed that alignment was hurting us. He suggested that we can get a 2-4x kick if we properly aligned.

According to the HDF 5 1.5.6 Release Notes this problem has been recognized by the developers and fixed. I have not yet had the chance to test this.
The small records we write out at the beginning of the output may be as expensive as the large unk dump at the end. I will experiment with packing these records into a single record.

We pay a cost for each I/O write, and some of these records have only one number (a single real) that is written. Packing these either manually, or somehow instructing the library to buffer a bunch of records before writing to disk may help the performance.
2-phase I/O may help us out -- this is something that needs to be looked at.
Very few groups are doing parallel I/O on Blue.
The HDF 5 benchmark that they have written only pulls 60 MB/s on Blue -- and this is only 1 processor per node! So we are not too far from that performance. Unfortunately, both are far below peak.

Richard is going to look into fixes for the memory bug that is preventing us from writing from a large number of processors. I will look into packing the first few records into a single record. Finally, we need to figure out why the alignment is not working. Apparantely, Kim Yates had it working in an earlier version of the library, and saw good results, but at some point (~ HDF 5 1.2.2?) it stopped working.

View the README file included with the I/O benchmark.

Download

Download the benchmark routine

Currently, this benchmark routine uses the following libraries on the ASCI platforms

Platform HDF library compiler version MPI version

ASCI Red 1.4.0 (parallel) FORTRAN: if90 Rel 3.1-4i
C: icc Rel 3.1-4i MPICH 1.2.1

ASCI Blue Pacific 1.4.1 FORTRAN: newmpf90
C: newmpcc IBM MPI

Frost 1.4.1 FORTRAN: newmpf90
C: newmpcc IBM MPI (Mohonk)

History

2001-06-22
added Rob and Dan's Chiba City specific code.

2001-03-20
eliminated the hyperslab selection on the memory space for the unknowns records -- this procedure was very slow, especially on Chiba City. The interiors of the AMR blocks are now extracted via a direct memory copy into a buffer array setup in the FORTRAN routines. This buffer is then passed onto the HDF 5 routines.

Fixed a missing dimension definition (N_DIM) in the C routines that was causing some of the smaller records to be defined with the wrong dimensions.

Packed several of the header records into a single HDF 5 compound object -- this reduces the number of writes required to store the data.

2001-03-04
changed the Red build to us the release version of the library.

2001-02-28
updated the benchmark routine to include plotfiles with and without corner data.

2001-02-19
added some dataset chunking calls. I don't yet know how big of an effect (if any) these have. Chunking can be enabled by setting the CHUNK preprocessor directive.

2001-02-18
added some MPI_Info hints for the TFLOPS platform.

2001-02-14
added platform dependent code, delimited via preprocessor directives. TFLOPS is the ASCI Red platform, IBM is ASCI Blue Pacific, and SGI is for a generic SGI machine.

2001-02-09
initial version of the HDF 5 v. 1.4 I/O benchmark program for FLASH.

Performance

Note: the timings reported below are for the entire I/O routine, not just the writing to disk. For example, the timings include the hyperslab selecting, interpolation to corners (if necessary), and reduction in precision (for plotfiles). Thus they represent lower bounds for the actual bandwidth to disk. However, since all of these operations are necessary each time we write a file, including them in the timing is not wrong, as the actual disk bandwidth will always scale with these numbers.

Timings on ASCI Red:

(3-3-01) These calculations were run on janus writing to /pfs_grande/tmp_2/zingale/. -proc 2 mode was used, with ~ 80 computational blocks per processor. The submission script looked like:

#!/bin/csh setenv MPI_HEAP_SIZE 204800 setenv MPI_MATCH_LIST_SIZE 20000 setenv MPI_SHORT_MSG_SIZE 1024 setenv MPI_COLL_WORK_SIZE 10240 setenv MPI_GETPUT_ML_SIZE 20 cd $QSUB_WORKDIR echo "about to run the FLASH I/O benchmark" yod -masync -proc 2 ./flash_benchmark_io

and the job was submitted with:
qsub -q edu.day -lT 3600 -lP 64 submit

The table below gives the timings and file sizes for the three files generated by the benchmark program.

	checkpoint file			plotfile			plotfile w/ corners
# of procs	size	time	MB/s	size	time	MB/s	size	time	MB/s
64	509837188	112.4395	4.324	42697532	44.5674	0.9137	60692908	47.3178	1.223
256	2039594116	399.0858	4.874	170783804	164.8542	0.9880	242775724	155.0007	1.494
512	does not complete

As the table shows, the performance is not all that hot. It is better than before (i.e. the old version of the HDF 5 library). These tests were run using collective I/O (see the benchmark program and look at the code fragments in the TFLOPS directives).

On 512 processors, the file size should grow above 512 processors. Right around this point, the code issues a large number of errors:

(no attribute caching). File locking failed in ADIOI_Set_lock. If the file system is NFS, you need to use NFS version 3 and mount the directory with the 'noac' option
It turns out that Red does not have support for files > 2 GB (despite the conflicting reports in the documentation). There is no plan to upgrade pfs on Red to support large files.

Timings on ASCI Blue Pacific:

(3-3-01) These calculations were run on blue writing to /p/gb1/zingale/. All 4 processors on a node were used as MPI tasks. IBM MPI was used with HDF 5 v.1.4. The submission script looked like:

#! /bin/csh -x #PSUB -s /bin/csh #PSUB -c "pbatch" #PSUB -ln 32 # Number of nodes you want to use #PSUB -g 128 # Number of processors you want (ln * 4) #PSUB -eo #PSUB -tM 1:00 # cd /p/gb1/zingale/io_bench/ set exec=/p/gb1/zingale/io_bench/flash_benchmark_io # setenv FLASHLOG flashlog.$$ echo "running: " $exec > $FLASHLOG # poe $exec >> $FLASHLOG

and the job was submitted with
psub < runflash

The table below gives the timings and file sizes for the three files generated by the benchmark program.

	checkpoint file			plotfile			plotfile w/ corners
# of procs	size	time	MB/s	size	time	MB/s	size	time	MB/s
64	509837188	22.7511	21.371	42697532	4.8059	8.473	60692908	3.3724	17.163

Currently, 256 processor jobs do not run on Blue due to a problem in the IBM MPI / HDF 5 interaction. I am uncertain how to resolve this currently.

Timings on Frost (LLNL):

(3-3-01) These calculations were run on frost writing to /p/gf1/zingale/. All 16 processors on a node were used as MPI tasks. IBM MPI was used with HDF 5 v.1.4.1. The submission script looked like:

#! /bin/csh -x #PSUB -s /bin/csh #PSUB -c "pbatch" #PSUB -ln 4 # Number of nodes you want to use #PSUB -g 64 # Number of processors you want (ln * 4) #PSUB -eo #PSUB -tM 1:00 # cd /p/gf1/zingale/io_bench/ set exec=/p/gf1/zingale/io_bench/flash_benchmark_io # setenv FLASHLOG flashlog.$$ echo "running: " $exec > $FLASHLOG # poe $exec >> $FLASHLOG

and the job was submitted with
psub < runflash

The table below gives the timings and file sizes for the three files generated by the benchmark program.

	checkpoint file			plotfile			plotfile w/ corners
# of procs	size	time	MB/s	size	time	MB/s	size	time	MB/s
64	510374172	11.92	40.8				61064036	2.09	27.9
256	2041748508	15.08	129.1				244266596	11.12	20.9
512	4083580956	28.32	137.5	344549364	10.87	30.2	488536676	8.84	52.7
768	6125511872	27.56	212.0	488536676	12.07	40.8	732818536	13.43	52.0

Currently, there are no know problems on frost.

Problems

On ASCI Blue Pacific, the program completes fine on 64 processors but on 256 processors, I generate a lot of errors:
0032-113 Out of memory in MPI_File_write_at, task 12
from each processor.

This seems to occur independent of whether I use collective or independent I/O.

UPDATE: This occurs in the 1.2.2 and 1.4.X version of the library only with IBM MPI. 1.2.2 with MPICH produces no errors.

On ASCI Red, writing to /pfs_grande/tmp_1/zingale/, when I try 512 processors, I generate a large number of errors:
(no attribute caching). File locking failed in ADIOI_Set_lock. If the file system is NFS, you need to use NFS version 3 and mount the directory with the 'noac' option
This is right at the point where the filesize grows above 2 GB. I am not certain at the moment how to interpret these.

Contact zingale@flash.uchicago.edu with comments on the benchmark routine.

Platform	HDF library	compiler version	MPI version
ASCI Red	1.4.0 (parallel)	FORTRAN: if90 Rel 3.1-4i C: icc Rel 3.1-4i	MPICH 1.2.1
ASCI Blue Pacific	1.4.1	FORTRAN: newmpf90 C: newmpcc	IBM MPI
Frost	1.4.1	FORTRAN: newmpf90 C: newmpcc	IBM MPI (Mohonk)