what file format for long term archiving?

Discussion forum for Amadeus users

Moderator: Martin Hairer

Post Reply

What is the best file format for long-term archiving?

AIFF
1
100%
Amadeus
0
No votes
Apple Lossless
0
No votes
FLAC
0
No votes
MP3
0
No votes
WAV
0
No votes
 
Total votes: 1

chetstone
Posts: 8
Joined: Sun May 17, 2009 2:10 am
Location: Crestone, CO, USA

what file format for long term archiving?

Post by chetstone »

FLAC seems to have the best compression of the lossless formats,
but what are the chances I will be able to decode the file 30 years from now?

In general I would think an open source format like flac would be more trustworthy over the long run as opposed to a proprietary format like apple,
but I did have one freaky experience where I saved and closed a FLAC file, then tried to open it again and Amadeus got an error, said it couldn't open the file. Quitting Amadeus and starting it again fixed the problem and I couldn't replicate it again.

rfwilmut
Posts: 255
Joined: Fri Nov 17, 2006 1:19 pm

Re: what file format for long term archiving?

Post by rfwilmut »

There can be no guarantees about this - just think what things were like 30 years ago - but the safest formats would be the most common - WAV and Audio CD. AIFF is more specialized; you should never archive in a compressed format; and with all due respect to Martin, Amadeus format would be a poor choice - who knows whether the program will be usable in 30 years.

Far more problematic is the medium on which you store your files. Recordable CDs and DVDs may only have a life of a few years - it varies widely and they haven't really be around for all that long yet to make assessing the situation reliable. For data storage magneto-optical may be the most stable, so it's a pity it's fallen completely out of computer use, and Sony MiniDisks (the same technology) are pretty well dead (Sony no longer make the hifi machines, only the portables). Computer hard disks, left unconnected and stored carefully (and run up very occasionally to prevent stiction), may be fairly safe - magnetic images seem pretty stable under good conditions (I have audio tape recordings I made fifty years ago) - with audio tape the weakest link is the binder which holds the oxide onto the plastic base: I don't know how mechanically stable hard disks are but at least they're sealed.

But things change so fast that it's a genuine concern. Already some formats, both physical and file formats, are almost impossible to reproduce. We progress backwards - I have an original 78rpm record made 113 years ago and I can play it perfectly well.

Sonic Purity
Posts: 82
Joined: Sat Nov 10, 2007 11:58 pm
Location: Pasadena, California, U.S.A.

Re: what file format for long term archiving?

Post by Sonic Purity »

Excellent points about the medium.
rfwilmut wrote:There can be no guarantees about this - just think what things were like 30 years ago - but the safest formats would be the most common - WAV and Audio CD. AIFF is more specialized; you should never archive in a compressed format; and with all due respect to Martin, Amadeus format would be a poor choice - who knows whether the program will be usable in 30 years.
First, CDDA, standard Red Book audio CD format, lacks 100% error correction (the player is instructed to interpolate or mute what it cannot correct), making CDDA a poor archival format choice.

Second, AIFF is not more specialized than WAV (WAVE): they’re basically both CDDA with a 100% error correction wrapper. The main difference is the byte order. It can be argued that since there have been more Windows OS systems sold and those prefer WAV, that format (variant) may have greater longevity. Yet AIFF has been with us since early in the 1990s, used by, who?, Sun as well as Apple (if not Sun, another then-big company). Given that for a long time in the 1990s Apple pretty much had a lock on personal computer-based pro audio and AIFF or SDII (another variant of CDDA, basically) “owned” the file format space, i consider AIFF (or SDII for that matter) to be safe archival formats.

(I didn’t vote in the poll because there is no one best format choice: i consider any non-lossy-compressed formats to be a good archival choice, especially open or openly documented, freely licensed, and popular proprietary formats [if such exist, and i don’t know licensing details… making a guess about what AIFF might be].)
))Sonic((

rfwilmut
Posts: 255
Joined: Fri Nov 17, 2006 1:19 pm

Re: what file format for long term archiving?

Post by rfwilmut »

Sonic Purity wrote:First, CDDA, standard Red Book audio CD format, lacks 100% error correction (the player is instructed to interpolate or mute what it cannot correct), making CDDA a poor archival format choice.
My understanding is that the error correction bits allow first-level errors to be corrected accurately: larger errors cause interpolation and even larger ones muting. Also the interleaving reduces the size of errors by spreading them about. By contrast a data disk containing WAV or AIFF contains no error correction.

Sonic Purity
Posts: 82
Joined: Sat Nov 10, 2007 11:58 pm
Location: Pasadena, California, U.S.A.

Re: what file format for long term archiving?

Post by Sonic Purity »

rfwilmut wrote:
Sonic Purity wrote:First, CDDA, standard Red Book audio CD format, lacks 100% error correction (the player is instructed to interpolate or mute what it cannot correct), making CDDA a poor archival format choice.
My understanding is that the error correction bits allow first-level errors to be corrected accurately: larger errors cause interpolation and even larger ones muting. Also the interleaving reduces the size of errors by spreading them about.
Yes, exactly. I was being lazy with my typing. Your CDDA description is much more accurate, and i thank you for typing it out to keep the discussion clear.
rfwilmut wrote:By contrast a data disk containing WAV or AIFF contains no error correction.
Not correct. The 100% error correction comes from the file system. Consider this: an error of a single byte to an application program file can convert the application from fully functional to fully nonfunctional. Nothing less than 100% error correction is acceptable for computer data file systems. If one does a Finder copy from one data volume to another and it cannot be done byte-perfect, Finder puts up an error message.

Put another way, the error correction is being handled as part of the file system rather than intrinsic to the file format.

Apple does not allow CDDA files to exist as such on Macs. Standard CDDA audio CDs have their contents extracted to AIFF, during which process the CDDA error correction, then interpolation, then if necessary muting is used.

I keep CDDA discs for playing. I keep my master audio files for those discs and for other audio projects as AIFF (or FLAC or Apple Lossless), on HFS+ volumes, with 100% error correction and no lossy compression. I can copy them over and over, as data files on a decent file system, and have them be byte-identical. This cannot be done with CDDA, which is read slightly differently every time the (same) disc is run, if there is even a slight error exceeding what the CDDA format can fully correct itself.
))Sonic((

rfwilmut
Posts: 255
Joined: Fri Nov 17, 2006 1:19 pm

Re: what file format for long term archiving?

Post by rfwilmut »

Sonic Purity wrote:The 100% error correction comes from the file system. Consider this: an error of a single byte to an application program file can convert the application from fully functional to fully nonfunctional. Nothing less than 100% error correction is acceptable for computer data file systems. If one does a Finder copy from one data volume to another and it cannot be done byte-perfect, Finder puts up an error message.

Put another way, the error correction is being handled as part of the file system rather than intrinsic to the file format.
But if there is an error, the correct value can only be restored if there are correction bits added to each byte, from which the error can be calculated and the correct value restored - this is how audio CDs handle first-level errors. You say that in the event of an error in data storage the error is flagged - but are you saying that the error can actually be corrected? - meaning that there is indeed redundancy in the file storage?

Sonic Purity
Posts: 82
Joined: Sat Nov 10, 2007 11:58 pm
Location: Pasadena, California, U.S.A.

Re: what file format for long term archiving?

Post by Sonic Purity »

…Doing research… need to get my facts straight (and references to cite) before commenting further. Hoping to post a proper reply tomorrow…
))Sonic((

User avatar
Martin Hairer
Site Admin
Posts: 1975
Joined: Wed Nov 08, 2006 11:49 am
Contact:

what file format for long term archiving?

Post by Martin Hairer »

What the discussion shows is that to some extent, choosing the right media
and file system is just as important (if not more!) than choosing the right file
format. Plain uncompressed AIFF and WAVE should always be readable, simply
because they contain hardly anything besides the raw data. However, if they
are stored on a file system that you don't have a driver for, it gets totally useless.

It is also not clear that your average computer still has an optical drive 20 years
down the line, but you can never be 100% future-proof. For the moment, HFS or FAT
on a regular CD-ROM should be safe for the foreseeable future... Regards,

Martin

HairerSoft
http://www.hairersoft.com/


_______________________________________________
Amadeus forum mailing list
Unsubscribe / change settings at http://two.pairlist.net/mailman/listinfo/forum_list

chetstone
Posts: 8
Joined: Sun May 17, 2009 2:10 am
Location: Crestone, CO, USA

Conclusion: archiving format

Post by chetstone »

Thanks all for the good discussion. As far as storage media is concerned, I was using data DVD's but began to have worries about longevity, so have decided to use 'live' hard drive storage. That is, just keep it on my main computer hard drive. I just upgraded it to 750GB and my archive is a small percentage of that, and at the rate I'm working on my digitization project, it's liable to stay a small percentage as drives get bigger. This gives me automatic redundancy, with my Time Machine local backups and offsite Crashplan backups.

I think I will continue with the FLAC format, it is significantly smaller than apple lossless. 10 years from now, if I start to get worried about the viability of that format, it will be easy to transcode the files (using Amadeus batch mode if it's still around ;-) ) to something else, since the files are on line.

Thanks again for the discussion. And Sonic, I will be interested in hearing about your research. I wasn't aware there was any error detection/correction in HFS+. But there must be, because file corruption is so rare these days you don't even think about it as a possibility (unless your hard drive is about to crash). Perhaps it's implemented in the low level hard drive firmware.

User avatar
Lou Kash
Posts: 102
Joined: Wed Jul 16, 2008 1:39 pm
Contact:

Re: Conclusion: archiving format

Post by Lou Kash »

chetstone wrote:I was using data DVD's but began to have worries about longevity
According to a test I've once read on the renowned www.heise.de the Verbatim Archival Grade DVD-R are supposed to last for decades. When bought in larger quantities, they aren't even much more expensive than regular brand DVD-R, and you even get a free jewelbox with each disc…

Regarding audio format, I prefer AIFF over WAVE due to metadata and markers compatibility. Although for less important audio like digitized vinyls I use ALAC in order to squeeze more tracks on a DVD, burned in iTunes (which then also includes an iTunes compatible playlist).

Sonic Purity
Posts: 82
Joined: Sat Nov 10, 2007 11:58 pm
Location: Pasadena, California, U.S.A.

Correction and Findings

Post by Sonic Purity »

First and most importantly, my assertion that the file system provides 100% error correction is totally and completely incorrect. 100% incorrect, in fact. I appreciate rfwilmut questioning my assertion, compelling me to think clearly and check my facts. I really hate it when i myself am the source of factual misinformation, so i appreciate the opportunity for this follow-up post to correct my error and set the record straight.

This is going to be a long post, so i’ll put the short answer and summary here:

* Error handling (detection, correction) for computer data volumes: It’s Complicated. It definitely exists, yet is in no way close to 100% pure uncorrupted data—far from it. Still, it is good enough that we’re not all running around screaming all the time about how numbers in our spreadsheets and databases are mysteriously changing, nor do most of us need to reinstall OSes and application software every few days on account of wholly uncorrected errors.

* Error handling for CDDA (audio CD format) discs: Correction (when possible), then Interpolation (when possible), then Muting as a last resort, as discussed already in this thread. A lesser level of error protection is deemed acceptable, given that small errors which can kill a spreadsheet or application tend to only be minor anomalies for audio.

I’m still going to go to sleep tonight at peace that my audio masters are stored as AIFF or FLAC or Apple Lossless data files on HFS or HFS+ volumes vs. on CDDA audio CDs. The rest of this post endeavors to explain why, and point out interesting information i found (with citations).

Data Volumes and Errors
There has to be some form of error management going on, otherwise none of us could use our computers: we’d have spreadsheets where $250 might spontaneously turn into $950 (if the unresolved error is a digit in our spreadsheet document file) and our application software would barely run. I base this last statement on what i have been told by computer programmers (of which i am not one): that changing a single byte of application code totally changes the program, likely in unknown and unpredictable ways. (This may apply down to the bit level. I don’t know, so i’m being cautious and typing “byte” instead.) One pundit sums it up this way:

http://storagemojo.com/zfs-threat-or-menace-pt-i/
The truth? You can’t handle the truth!
Actually, in the storage world, we insist upon it. Data integrity is the sine qua non of data storage. Fast is good, accessible is good, but if it isn’t right, nothing else matters.

To ensure data integrity, all systems use some form of checksum to ensure some level of integrity. Yet that integrity may not be nearly as good as your friendly SE [system engineer] has led you to believe.

Most filesystems rely upon the hardware to detect and report errors. Even if disks were perfect, there are still many ways to damage data en route. In flight data corruption is a real problem.
Of necessity, CERN has had to study this problem in depth, to ensure that the vast amount of data they collect is accurate. According to this report, they’re not at all pleased with what they found in terms of “silent” (no notification) data errors, and what they need to do about it to work around the problems. Still, they do note the following about the status quo:
In principle the whole data flow chain is protected through the implementation of ECC
(Error Correction Code) and CRC (Cyclic redundancy Check) :
1. The memory is capable of correcting single bit error
2. the cache in the processor is ECC protected
3. PCIe and SATA connections have CRC implemented
4. the disk cache has ECC memory and the physical writing to disk has as well ECC as CRC in a complicated manner implemented to correct up to 32 byte errors (per 256 bytes) and detect any data corruption. The data is actually 5 times encoded before it reaches physically the disk.
Wikipedia: Error Detection and Correction
Modern hard drives use CRC codes to detect and Reed-Solomon codes to correct minor errors in sector reads, and to recover data from sectors that have "gone bad" and store that data in the spare sectors.[5]
Comments to an Ars Technica article on ZFS:
The key feature of ZFS data integrity is not that it will detect errors "instantly" in all cases. It's that it will detect these kinds of errors at all. Other file systems won't.
A different commenter:
It helps that just about every modern disk to controller interconnect and up uses some form of ECC (SATA, SAS, SCSI, IDE UDMA) or parity (from the controller upwards)….
This isn’t just a Mac thing, or a HFS thing. This apparently applies equally to all personal computer file systems, other than ZFS, which apparently is in use on Sun Solaris systems, and has been part of Leopard server (read only if i recall what i read correctly) yet not any older nor newer Mac OS (client or server) so far.

Even passing information about an error which has occurred (referred to as “error propagation” in the following item) fails to happen all too often in the real world:
In conclusion, error propagation appears complex and hard to perform correctly in modern systems.
Another overview of the problem. If you only read one article out of all for which i have links in this post, make it this one:
Data corruption is worse than you know
The bottom line
CERN found an overall byte error rate of 3 * 10^7, a rate considerably higher than numbers like 10^14 or 10^12 spec’d for components would suggest. This isn’t sinister.
It’s the BER of each link in the chain from CPU to disk and back again plus the fact that for some traffic, such as transferring a byte from the network to a disk, requires 6 memory r/w operations. That really pumps up the data volume and with it the likelihood of encountering an error.
So yes, storing AIFF or WAVE or FLAC or anything else on HFS or HFS+ or FAT(any) or NTFS or anything you care to name (other than ZFS) means that you and i cannot count on every byte stored being what was originally in the file. Now, let’s let the author of the above article put that in perspective:
My system has 1 TB of data on it, so if the CERN numbers hold true for me I have 3 corrupt files. Not a big deal for most people today. But if the industry doesn’t fix it the silent data corruption problem will get worse.
I don’t know about you, but despite my various audio efforts, i still have vastly less than 1 TB of master audio files… probably still under half that, in fact. So, if the CERN numbers hold true (and there are too many variables and we don’t know), i may have one file with a silent (unknown, undetected) error. If it is an audio file, depending where that error is, it may be inaudible, a slight glitch, a loud glitch, or i suppose the file might be unreadable as its native format and may need Amadeus’ raw import for recovery (so hopefully the file was uncompressed audio!). It might instead be Amadeus Pro, which would then probably act wonky, and need reinstallation. Or some important part of the OS, with the same result: reinstall.

CD Audio (CDDA or CD-DA) Discs and Errors
Data drives (all technologies) do repeated reads if their firmware or some higher-level software detects a problem. Drives in CD-DA mode do not, unless the controlling software tells them to. Audio CD players do not, at all: they correct, or interpolate, or mute.
My source of the following information is Principles of Digital Audio by Ken C. Pohlmann, first edition. ISBN 0-672-22388-0.
p. 256
Theoretically, the raw bit error rate on a CD is between 10^-5 and 10^-6…. Following CIRC error correction, the bit error rate is reduced to 10^-10 or 10^-11…. In practice, because of the data density, even a mildly defective disc can exhibit a much higher bit error rate.
And that’s for a mass-produced audio CD, not a CD-R.

Wikipedia: CD-ROM
Unlike a music CD, a CD-ROM cannot rely on error concealment by interpolation, and therefore requires a higher reliability of the retrieved data. In order to achieve improved error correction and detection, a CD-ROM has a third layer of Reed–Solomon error correction.[4]
AIFF has no error detection nor correction mechanism within the format specification (AIFF spec. v. 1.3, Apple Computer).

Bottom Line
Whether i’ve successfully provided sufficient references or not, the bottom line, at least for me, is that data file storage, while imperfect, has a greater level of error minimization (however achieved) than CD-DA format audio CDs. Further, on the Mac and without using unusual software, it is difficult to store pure CD-DA data, and easy to store the very similar AIFF or WAV data. This implies that to store CD-DA format data, it is most likely going to be on a CD, probably a CD-R. Contrast this with the easy ability to store AIFF on any HFS/HFS+ volume (as well as other file structures).
Like many here, i have years of experience dealing with the CD formats in different capacities, including having spent a lot of time looking at signals off of good and defective CDs (audio and Mac HFS/HFS+ data, mostly) and players/drives. Also like many here, i’ve had some hard drives die, Macs crash, and general software woes. (I did mass storage Q.A. testing at Apple for awhile, 1996-97.) Even though my erroneous assertion regarding where error correction was taking place and its effectiveness was far off the reality mark, it remains my experience that storing audio files as data (AIFF, WAV, FLAC, etc. etc.) is likely to collect fewer errors over the years than storing as CD-DA.

Preparing to post other places where i have previously made my erroneous assertion, to straighten out the (unintended) misinformation,
))Sonic((

rfwilmut
Posts: 255
Joined: Fri Nov 17, 2006 1:19 pm

Post by rfwilmut »

Thanks for that very detailed and well-researched reply. It's an interesting can of worms, but you've de-wormed it by a significant amount. There is obviously never going to be a 100% archivally secure method; probably a good answer where possible is to have duplicates on a different method.

The problems define into three groups:

Reliability of the stored data.

Reliability of the physical medium.

The ability to read that medium in the distant future.

A good hard disk has a lot going for it, particularly as the connection method - USB, Firewire, SCSI - doesn't matter as the disk can be hopefully be put in a different enclosure if necessary. It's a pity that magneto-optical has pretty well gone out of use as it may be the most stable format - though magnetic images seem to hold up well; it's the physical side which can be problematic (the binder which holds the oxide onto the base, the bearings and drive mechanism in a hard disk, and so on).

I have my most important audio stored on several hard disks, CD-Rs and Sony Minidisks. As I'm 68, if I can get another 25 years out of them that will probably be enough!

CDJonah_alt
Posts: 379
Joined: Thu Mar 15, 2007 3:57 pm

what file format for long term archiving?

Post by CDJonah_alt »

I will toss in one more item; this comes from the old days and may not
be possible today. If one saves a file and there is a problem in a
localized region and if one can get it back (not sure how possible that
is today for most of us), an uncompressed file will have the problem in
a localized region and one can probably mitigate the problem by
interpolation, etc. On a compressed file, the error will be affect a
larger portion of the data. In a lossless compression, one has removed
degeneracy so one might not be able to recover anything over the width
of the compression.

Chuck

On 1/6/11 3:39 AM, rfwilmut wrote:
Thanks for that very detailed and well-researched reply. It's an interesting can of worms, but you've de-wormed it by a significant amount. There is obviously never going to be a 100% archivally secure method; probably a good answer where possible is to have duplicates on a different method.

The problems define into three groups:

Reliability of the stored data.

Reliability of the physical medium.

The ability to read that medium in the distant future.

A good hard disk has a lot going for it, particularly as the connection method - USB, Firewire, SCSI - doesn't matter as the disk can be hopefully be put in a different enclosure if necessary. It's a pity that magneto-optical has pretty well gone out of use as it may be the most stable format - though magnetic images seem to hold up well; it's the physical side which can be problematic (the binder which holds the oxide onto the base, the bearings and drive mechanism in a hard disk, and so on).

I have my most important audio stored on several hard disks, CD-Rs and Sony Minidisks. As I'm 68, if I can get another 25 years out of them that will probably be enough!




_______________________________________________
Amadeus forum mailing list
Unsubscribe / change settings at http://two.pairlist.net/mailman/listinfo/forum_list
--
Charles D Jonah
Building 200
Chemical Sciences and Engineering Division
9700 S. Cass Avenue
Argonne, IL 60439
630-252-3471 CDJonah@anl.gov

_______________________________________________
Amadeus forum mailing list
Unsubscribe / change settings at http://two.pairlist.net/mailman/listinfo/forum_list

Post Reply