By robb allan | Wed, 05/19/2021 - 18:59
2014-07-26T14:10:07

Data backups: truly the one task every single computer or tablet owner should do. Yet many don't, and those who do often don't do it correctly, in a way that allows them to recover data that may have been damaged some time before they realize it. Since more and more of our valuable data is digital rather than paper, this issue has become increasingly important, and the consequences of a bad decision set are more devastating.

For most people, backups simply mean a copy of the selected device – disk, phone, or tablet. But several decisions interact when setting this up. Is the backup on the same disk? Is it performed on all the desired data? Is it repeated frequently enough to capture the most recently added or changed data?

Generally, same disk backups are just wrong. Sometimes a single file may be corrupted or erased in a discrete action, perhaps by a program crashing or inadvertent user error, and in that one circumstance a separate backup on the same disk offers  a chance to recover. But far more often, the entire disk fails, losing all data, including resident backups (see Google's analysis of this at Failure Trends in a Large Disk Drive Population). Some more advanced users have created RAID arrays, assuming that this overcomes single disk failure issues, and in most cases, it does; but I have had a server lose its entire RAID array when the RAID hardware card itself failed, corrupting the entire array of disks. Sadly, one of those disks was my backup of the rest of the array. Hmmm.

Going external

So the most basic choice is to move the backup onto an external disk. For mission- or business-critical data, this might mean an external disk in a different physical location, such as a network-connected server or, more currently, the cloud. More expensive, of course, but more secure, since a fire, electrical surge, or other event that physically damages the main data device won't simultaneously destroy the backup device, too.

But let's examine the "more expensive" problem.

Imagine you own a desktop, with you your family's photos, tax files, email, and other collected data on board. Then, you and your wife have iPhones, with precious photos. And perhaps you have an iPad, with a different set of photos.

Suppose you back all of these devices up to an external disk (using iTunes for the phones and iPad). Of course, you are taking new pictures every day, and you create new tax files every year, so one backup from a couple of years ago just won't cut it. The tempting solution? Back up all the devices, every day or two. After all, disks are cheap, right?

Fine. But wait: what do you do with the old backups? If you try to keep them (renaming the backup files, moving them into separate folders, etc.) you very quickly discover the limits of your external disk. So you buy another. But then you fill that, too. So your next temptation is to delete the old backups to free up space.

But what makes this a good idea? Perhaps the theory is that old backups aren't needed because a disk crash will be so obvious that you will know about it immediately, and the most recent backups are all that are necessary.

But not all data loss is obvious.

Sooner or later, every one of us will attempt to open a file and get the awful message that its contents can't be read. Sometimes, the file is apparently empty, with zero bytes. Sometimes its format seems to have been corrupted. Sometimes it is just missing. But in each case, the discovery is not associated with an obvious data loss event like a crash or accidental erasure, and often the time of the corruption is simply unknown. And this could mean that the most recent backups have backed up the corrupted version of the file.

Now suppose you've been doing this for a week or two, and erasing the oldest backups to make space. At some point, you will erase the last good version of the file, and all your backups will be corrupted. In other words, deleting backups is as bad as not backing up to start with.

Yet you can't physically keep every single full backup of everything. So then you start reducing the set of what you back up, eliminating unchanging system files, or applications, or anything else you feel you can reproduce from original disks or the internet or elsewhere. And this will work, until at some point a little farther down the road when you get the "no more disk space available" message yet again.

Enter incrementalism

Years ago data professionals devised a more elegant solution to this problem, called the "incremental backup". In this method, a single master backup is created. But each subsequent backup does not create an entirely duplicate backup set: instead, subsequent backups compare the original data against the existing master backup set, and only back up what has changed. Some days this could mean no new data is backed up at all; other days, many files have changed. But in either case, the subsequent backup sets are drastically smaller, allowing for older backups to be kept for much longer, and thus increasing the opportunity to find the last good version of a damaged file.

Incremental backups are now the standard proper data backup methodology. But they have one drawback: recovery management.

Consider this: you have a master backup set, and twenty or so incrementals. Now you accidentally erase an important file. Fine: off to the backup, where you look at the last incremental set. Perhaps yesterday's version of the file is there, so restoration is quick and simple. But maybe there was no change yesterday, or the day before, or the day before that – in fact, you don't really know when the last version was saved. Finding the last backed up version of a file can be a digital version of hide- and-go-seek. It can be made worse if you need to find a version of a file that was backed up in an earlier incremental, because now you have to inspect the contents of each version of the file to find the right one.

Let's look at another scenario: you crash your entire disk. After the required half hour of panic and cursing, you finally accept that you need to erase and rebuild the disk from scratch. So you dutifully erase it, restore the OS, and now have to restore your backed up files. With a basic collection of incremental backups, this means restoring the first master backup set, and (wait for it), one by one, restoring each subsequent incremental backup, every step overwriting any changed files with more recent ones. Why do this? Because, remember, the last incremental set will not have every file changed since the original master backup – only the files changed since the previous incremental. You have to restore every incremental set in sequence. Imagine 50 days of incremental backups.

You might wish that you had shortened the process by performing newer, more frequent master backups, thereby reducing the number of incrementals. But then you are back to the "erasing oldest backups" problem, in a new guise.

Time Machine

Some professional backup software houses have offered applications over the years to make this process more manageable, perhaps by providing utilities that list all the files in a set of incrementals alongside the original master to allow for more simplified browsing of versions, or by searching through all of a collection of incrementals to find the various versions of a selected file. Some have offered recovery utilities that automatically restore the original master, and all of the incrementals to a specified date.

These work until they don't. Software companies may discontinue applications, they are acquired and shut down, they go out of business, or they change the internal format of a backup so that newer utilities can't read older backup sets. (Sometimes, as happened with NASA, backups are just stored and forgotten, until someone realizes too late what they contain.)

A better solution relies on an important but abstruse feature of many OSes, called a hard link. To understand why, let's use the following analogy. Imagine if your house had only one key to the front door. You couldn't leave the house in the morning, lock it, and take the key with you – your family could never lock the door themselves, or get back in if there were already out. So you all agree to leave the key under the mat. Now, of course, you all have to know that the key is under the mat, so you all discuss it and agree on that location. Perhaps you each carry a slip of paper that says "key under mat".

That works, until one day you lose the key. All the reminder notes are useless, and no one can get in.

Now you all decide that each of you should have your own key. Even if several of you lose your keys, as long as one remains, everyone can get into the house.

Now let's look at the way files are stored. Typically you save, say, a vacation photo into a single file on your disk. But what if you want to save it in several places, perhaps one folder for your vacation pictures, another for your pictures of your children, etc.? You'd normally make multiple copies, one for each folder. But of course this wastes disk space, and worse, if you edit the photo to remove the red-eyes from your kids, you have to duplicate that over and over again to all those folders.

So, instead, many OSes offer something called a "symbolic" or "soft" link, or an "alias" in the Mac world. After you have saved the original file, you create soft links or aliases to it, and put those into the other folders. Soft links are like the reminder notes your family had for the key under the mat: they simply are pointers to the original file. This is great, because if you change the original file, all the soft links automatically will point to it, without updating. And this is also awful, because if you delete the original file, all the soft links point to...nowhere, just like losing that one key.

So, instead yet again, many OSes also offer "hard links". Each link refers to the file, but in such a way that you can't delete the original file until the last hard link is deleted. In fact, you can't tell the difference between the original file and each of its hard links: hard links act as though they were individual copies of the file, except that they all refer to only one file on your disk – just like multiple keys refer to one lock. You can delete any of the hard links, but since they all refer to the same file on disk, that file only disappears when the last hard link to it is removed.

This seems a bit strange, and it's the reason that hard links are not often used. In the Mac world, which is based on Unix and therefore has hard link capability, the Finder offers no way to make one, and the vast majority of users have no idea they exist.

But Apple does, and it's the basis for Apple's fantastic incremental backup tool called Time Machine.

Time Machine combines several great features into a single, simple user utility. It creates incremental backups. It creates them automatically on a regular, frequent schedule. It creates them to an external disk. And it hard links them together.

What?

Time Machine creates incremental backups, saved in a series of folders on the backup disk in order by date. But each folder, when inspected, seems to contain all the files of the original disk, not just those that incrementally changed. How can this be?

It be, because with every incremental backup, Time Machine includes, in the incremental backup folder, hard links to all the unchanged files from the original master backup set.

Remember that hard links all refer to the same file. Therefore they take up almost no disk space: only the underlying file they refer to contains the data. But hard links behave as if they were the original file. Deleting other hard links does not delete the underlying file. So Time Machine can erase the earliest master backup folder and yet all the files remain on the backup disk, referred to by the hard links in the incremental backup folders. The files only disappear when the last hard links to them are removed, say, when the last incremental backup folder is deleted. So each and every incremental backup folder appears to contain an entire backup set, yet the amount of disk space necessary to save all of them is just the size of the original backup plus the size of the changed files. Brilliant.

Even more brilliantly, it nows becomes trivial to restore an entire disk: just restore the contents of the last incremental backup, since it contains the latest file changes and hard links to the most recent previous unchanged files.

As you can imagine, Time Machine has basically put out of business virtually all of the Mac-based backup software products once offered. By embedding Time Machine in the OS itself, creating a simple and visuality dramatic utility to browse backup sets, and automating the creation of intelligently linked incremental backups, Apple has given users a state-of-the-art backup tool that even your grandmother can use. For free.

The only problem? It only works on a Mac. There is no Time Machine for the iPhone, or the iPad, or for other OSes.

Faking it

Well, actually you can get around this on the iPhone and iPad. How? By syncing your devices with the Mac. When photos and music files and contacts etc. are synced from the iOS to a Mac, copies exist on the Mac. And when Time Machine backs up that Mac, voile, you have incremental backups of your iPhone's data.

So that leaves everyone else, poor dummies for not buying a Mac. Or does it? Well, surprise, it turns out that anyone else using a Unix-based computer, or even some Windows boxes, can write a simple utility themselves that essentially provides all of the capabilities of Time Machine, although without (sigh) the terrific Apple GUI. And they can do so with a wonderful tool called rsync.