FITS headers map very well to records in a traditional RDBMS table. An archive of millions of FITS headers would not strain the limits of today's database engines; with proper maintenance and indexing, very rapid and intelligent searches on million-record tables are routinely done throughout the information systems world today. Now all we have to do is include in each record a reference to the location of the corresponding image file, or adequate information to instruct a jukebox to retrieve the file.
Supposing that we can really acquire, say, 26GB of archive-worthy data every night (which won't actually happen, but we should plan for the worst case), and that DEIMOS is in use one quarter of the time (since there are 4 instruments planned for Keck-II). Supposing in addition that every night of the year is perfect, we have 91 nights of DEIMOS observing per year, we end up with 2366GB of image data to archive every year -- a daunting prospect.
Realistically, Keck observing logs indicate that only about 38% of telescope time is spent actually acquiring images with Keck I. If we assume that we will do a little better with Keck II and make good use of at least 50% of our observing time, then we can reduce our yearly acquired data volume to 1200GB. We can then further estimate that some number of nights are lost to bad weather and engineering, perhaps a fifth of all available nights. 4/5 of 1200GB brings us down to 960GB, a little more than a third of our original estimate.
In either case, it is unreasonably costly to archive this quantity of data on traditional hard drives: the acquisition of the equipment, though appallingly expensive, would be less of a difficulty than the maintenance of such an enormous disk farm. The only technology capable of dealing economically with this volume of data is database + archival media jukebox, and today's more affordable jukebox technology is not capable of handling our worst-case volume of data. Our more optimistic estimate, however, is within the reach of today's tools.
Tape jukebox using high-speed, high-density cartridge tapes such as the DEC (Quantum) DLT may seem like a better option today, since the cartridges hold up to 40GB and provide relatively rapid seek times (as compared to other tape transports such as Exabyte). Still, the largest jukebox I have seen for DLT cartridges held only 10 units (400GB), only a slight improvement over CDROM jukes in storage space. The seek time is not rapid when compared to CDROM media, nor are these jukeboxes cheap ($10-15K); and the media are more frail than CDROM. For a long-term historical archive we should choose the most robust, damage-resistant media possible.
However, we must assume that no choice of media and format, however optimal at the time, will endure for the (indefinite) lifetime of our archive. At some point the technology will become obsolete or the volume of our data will exceed its practical capacity; we will then have to "migrate" the data to a new medium and format. This expense (of periodic migration) must be accepted and planned for, if an archive of lasting historical value is our aim.
If we acknowledge that any technology choice for archiving will have at most a 5-year lifetime in practice, then CDROM jukebox is probably the optimal choice for the first generation. The 5inch CDROM format is as close to universal as the industry offers; CDROMs are robust, and the robotics to handle them are well-understood and mass-produced. The initial cost of a 500-CD jukebox of good quality is on the order of $25-30K.
How, then does the archiving concept interact with the need for reliable backups of each night's observing? It seems foolish to run two completely separate software packages; if we assume that all images are archived, and that all image headers are stored in our database, we have taken care of all backup issues already; we now have one remaining problem which is the duplication of media (obviously backup media want to remain on Mauna Kea, whereas the archival jukebox/WWW-server are likely to be elsewhere in Hawaii or even on the mainland).
Copying very large CDROMs may be time consuming and although it could conceivably be done during the day, daytime may be used for downtime and network interruptions which could prevent the duplication. Ideally we would like to make two identical copies on the spot, during observing, with no additional use of network bandwidth or human time. If we can acquire hardware that will duplicate signals to two identical SCSI devices, making them look like one device, then this is a practical and ideal strategy. We need to investigate the existence and/or cost of such hardware (effectively RAID for CDROM).
de@ucolick.org