Requirements for ICE Radio Data archival and storage. Because the IBM will not be with us for much longer, it is imperative that we retain the ICE data set by transferring it to a medium other than the 3480 tape cartridges on which it now resides. There are only a few alternatives for media: CD-ROM and 4mm tape. Both have the advantage of being easy to store, but CD-ROM retrieval is much faster and the medium is more durable. In either case our longterm view is that this archive should have a useful lifetime of 10-15 years and that we would expect to transfer the data 10+ years from now to whatever is the optimal medium at that time, given that our computer equipment is likely to change radically by then. So we aren't looking for eternity in our storage technology or protocol. With that in mind, the following are the requirements: Which data to transfer? We will transfer the 1.5s modulated data, the 1.5s despun data, and the 108s averaged data onto CD-ROMs, each of which will contain 2 months during the high bitrate portion of the mission (i.e. approximately 10 years). Subsequent years will have the EDR, the 1.5s modulated data, and the 108s averages. We have no despun data during low bitrates. We will also make separate disks of the datapool and 30m averaged data - one CD should cover the entire mission for each of these. We will not transfer the original EDRs for the high bitrate period. Our reasoning is as follows: a. The data set is large and would increase the number of CDs by a reasonable fraction. b. The programs used to process high bitrate EDRs are heavily dependent on the IBM, using IBM-specific coding practices and system resources and would be time-consuming to convert to VAX. c. The derived datasets are well-proven and the 1.5s modulated data provides an excellent basis for any further analysis since no information is lost from the EDR, merely translated into a more convenient form. Data on the CD will be compressed using an efficient compression utility. This utility will be available (uncompressed!) on the CD itself in case it ever disappears off our system or isn't available on another. Ensuring data quality We are very concerned that the data put on the CD-Roms shall be identical to that maintained on the IBM. With that in mind we have established the following requirements for data validation: a. Continuity We will compare graphically the data copied from 3480 tape onto IBM disk with the original daily dynamic spectra to ensure that the same coverage is present. This ensures that the copy jobs did not abort early due to tape errors. A representative pair of dynamic spectra and continuity plot are attached. The data file generated to do this, once checked, will form the yardstick for subsequent automatic comparison steps. Three files are copied from the IBM to the VAX for each 24 hour period, the binary file containing the data for the given day, an ascii file of the same day containing the date and time in milliseconds and an ascii file containing selected records (every 100th record for the 108 sec data and every 1000th record for the 1.5s modulated and despun data). The IBM binary file is converted to vax binary and the two ascii files are created from the file on the VAX. The two sets of ascii files, one set created on the IBM and the other created on the VAX are compared to ensure that the vax binary file is identical to the file on the IBM. The vax binary file is then compressed using the GZIP utility and transferred to the SUN. An image of the data files on the SUN is created for copying to CD-Roms. Once the first CD of the set is made, its contents will be read on the VAX and a file of dates and times made which will be compared one-to-one to the file made originally on the IBM. Only if they are identical will the remaining three CDs be made. Note that, apart from the first visual comparison with the dynamic spectra, all comparisons are done by the computer and not by eye. b. Integrity: When the data is copied onto the IBM disk from 3480 cartridges an ASCII file is created that contains selected records of each day of data. This is the benchmark against which the other days are measured. Since we know that all records have been copied (from the continuity verification above) we are concerned that the contents of the record are the same. We check the selected records again after the copy to the SUN and again after we make the first CD. If all copies are identical the chances that the remaining records in the file are any different is very small. Automation: The only manual process in the transfer/validation portion of the exercise is the comparison of the data presence plot (an example is attached) with the original dynamic spectra. That step validates the initial copy and allows us to feel comfortable doing subsequent comparisons automatically. Software checks for the continuity and integrity of the data as well as determining its presence on each machine. The jobs are scheduled to run overnight to take advantage of lower rates for computer time. We are, at this point, a little unsure about whether we can transfer the design requirement of two months per night because sometimes system work on the IBM during the night impacts the running of our jobs and the operation of the network. A little more experience is required on this. Assuming we can outfox the IBM, we should be able to transfer data and create 4 CD copies each working day. The suggestion is that 2 copies remain here (in Bldg 1 and 2 respectively), one goes to France (Sang Hoang), and one to NSSDC. Apart from the manipulation of the CDs themselves, the only input required of the operator is the dates to be copied and the IBM job class. If the jobs do not complete successfully, the operator can identify those remaining, and where they failed, and resubmit the appropriate steps automatically. Data Representation: Data on the CD-ROM is written in VAX binary representation. We have considered using the IEEE standard binary representation - but we would then have trouble reading the CD on the VAX and some other possible platforms. Because we don't expect heavy use of the dataset, we don't anticipate much extra effort if it is needed on other platforms in converting the data on the VAX to ASCII for onward transmission. The CD will contain simple programs for reading the unformatted data files and making formatted ones. This document will also be on the CD with an AAREADME file containing basic information required to handle files. Level of Effort and Cost. We will use approximately 300 CDs at a cost of ~$15 each, for a total of $4500 in materials. Brent Ignacio will spend a substantial portion of his half time effort on this, at a cost of about $2400. In addition, Almaz and I have spent a substantial portion of our effort over the last few weeks in preparing automatic transfer processes (it is quite a complicated venture), but this doesn't really represent any extra cost - during this time we have maintained our other Ulysses duties as required. We have arranged to use existing LEP equipment so there is no added expenditure there. To do the same thing with 4mm tape would have cost about $2500 less (tapes are the approximately the same cost as CDs but hold twice the data) assuming we would have also compressed the data. But the 4mm tapes are a less reliable medium and considerably slower to read, and we would probably have had to refresh them after 5 or so years. The CDs will last the full 10 design years (and a lot longer if we don't decide to upgrade on general principles). An additional bonus is that we are acquiring valuable expertise for a similar venture to archive the Ulysses dataset.