I read Wolf’s I/O error treatise this morning (as well as Alaistair’s response), and thought I’d write a bit about how SuperDuper! actually handles I/O errors, and why. (In fact, this is an expansion and reworking of some email I dropped to Jonathan after reading the article.)

Although Wolf says otherwise, ditto isn’t our underlying engine. We use a variety of APIs in Cocoa and Carbon, augmented with much additional metadata copying. However, when we get a failure with those (such as an I/O error), we retry twice more: once with copyfile and once—just in case—with ditto, verifying after each one.

We do this because we’ve seen the rare case where one API fails but others do not. Weird, I know, but it happens.

If all three retries fail, we stop. This is done for safety: an I/O error could mean the drive is failing, and since you’re dealing with a live backup, it’s important to understand what’s going on. If a significant failure is occurring, steps should be taken to concentrate on recovering your user files, rather than trying to copy the whole drive. We don’t want a user (or SuperDuper!) to continue past the failure: we want them to stop, diagnose and—if necessary—get help. And since most users won’t know what to do (unlike Wolf or Alaistair, clearly), we make it really easy to contact support.

Our User’s Guide has a Troubleshooting section that helps a user determine whether the error is on the source or destination (I don’t explain there how to use the system log and a System Profiler report to locate the source, because it’s pretty obscure stuff—the amount of detail in our log is confusing enough for most), as well as general steps for recovery. But in most situations, 4K of 0s will pretty much be a fatal problem for the file. (I’m shocked, frankly, that Wolf’s Parallels disk was OK given the damage: he was very lucky.)

Most of the time, the problem is actually an iSight camera, iPod, or other bus-powered device misbehaving on the FireWire bus. On occasion, the problem is with the source.

Errors on the source are problematic. As Alastair mentions, modern disk controllers transparently relocate sectors when errors occur. Real problems happen when the drive’s out of spares, or when the on-disk error correction can’t handle the failure. And at this point, the drive has probably been silently failing for a while.

In many cases, SMART status will flag a drive that’s failing badly—SMART Reporter, a nice bit of freeware from Julian Mayer, can give you an obvious warning when this occurs, or even run a program (like SuperDuper!) to do a quick backup of critical files. But, often, it won’t, and experienced guidance and advice is necessary to help people understand what’s going on.

Anyway, as Wolf’s article indicates, and Alaistair agrees, it’s very difficult to continue in a way that ensures data is preserved as much as possible. It’s hard to know what really happened without being there, and an automated fix isn’t guaranteed. So, we’re super conservative. And while it’s obviously labor intensive, we think injecting a human into the process at this point provides the user with the best outcome.