How to use rsync

The file-transfer utilities rsync and unison have one very useful feature in common: If a file already exists at the destination end, only the differences are actually copied. In particular, if the files are the same, nothing gets transferred. This is most obviously useful when updating a massive set of files over a slow network, but is also a great timesaver if the transfer is interrupted part of the way through (restart to resume), or if the source fileset may have changed while we were copying it across.

Please note: The contents of this Page were originally taken as-is from a Top Tip of the Week of considerable antiquity by MaJoC (who himself is of considerable antiquity), filtered through the AstroCentralSupport pages of the old AstroWiki, and have now been dropped into Central Physics's IT Support pages. Please excuse us awhile while we clear up the mess. In particular, the continued advocation of Unison is a burden on our conscience.

If anything is unclear, outright wrong, or found missing, please contact IT Support (at the usual e-mail address) to suggest a suitable rewording, or add a Comment using the link at the end of this Page, and thereby help us improve this document.

Please see also:

rsync

An rsync transaction copies files in one direction. The following incantation is MaJoC's inter-system file-transfer weapon of choice:

rsync -avP -e ssh fromname@fromsystem:/fromdir/ todir/

.... recursively copies everything in the named remote source directory into the named local destination directory, using ssh for its data transport. See "man rsync" for the (ahem) surprising consequences of omitting the trailing "/" on either directory name. (The "-e ssh" is there in case your client hasn't been configured to use ssh by default. Most of them have been by now, so I'll omit it below.)

Updating only

If you've updated one or more of your files on the receiving end, rsync may inadvertently downdate them for you. If that's the case, then add the "--update" flag into the mix:

rsync -avP --update fromname@fromsystem:/fromdir/ todir/

This will, to quote the manual page, "skip files that are newer on the receiver", by looking at the respective timestamps. (This isn't the default because downdating is usually what you want to do to make sure everything's in sync.)

Changing ownership:

rsync -rlpt -vP -u /media/usb1/foo/ /data/system/user/bar/

.... which those in a hurry might write as:

rsync -rlptvPu /media/usb1/foo/ /data/system/user/bar/

.... copies from your USB drive to a data-mounted partition; exchange the directory names to reverse the process. Changing "-a" to "-rlpt" means you get to own the copies, even if your system and external USB drive don't agree on numeric IDs (people using USB drives formatted as HFS+ to transport data between Apple and Linux systems often suffer from this).

Test runs:

To test if you have the syntax correct, you can always perform a "dry run" first, by adding the "-n" option:

rsync -n -rlpt -vP -u /media/usb1/foo/ /data/system/user/bar/

This just lists all the files that rsync would have transferred, but without actually changing anything. Useful if you tend to forget about the above '/' issues on directory names, or if you plan to play with fire and use the delete options.

Tip of the Day: People will often see me start an rsync transfer, interrupt it with a Control-C, and run ls -l, then rerun the rsync. This is an old habit, which makes use of the fact that a restarted rsync picks up from where it left off: it checks that I've got the incantation right, and in particular that I'm not spraying the contents of the source directory all over what should be the destination directory's parent.

Said persons will also see me wait for rsync to complete, then run it again with the arguments unchanged. This checks the transfer was complete, and that nothing's changed at the source end in the meantime. Please see also Apparent failure to copy below.

For more details on rsync, please see:

  • "man rsync" (rather long: skip forward to the "Usage" section);

  • find your way round "info rsync";

  • http://rsync.samba.org/, and look for the full documentation.

Unison

Unison, which is based on the rsync transfer protocol, synchronises files in two directions at once. The intention is to keep two or more copies of each file in separate places, and to update each from the other. If one copy of a file has been modified, the default action is to change the other to match; if both have been changed, neither is modified, and the problem is brought to the attention of a responsible adult (you). This is all done by keeping notes, at both ends of the process, of the state of each file at the end of the last transaction. (Rsync by default goes by datestamps, which can be misleading. Unison by default doesn't propagate timestamps; this can be even more misleading, and is quite definitely a mess to sort out afterwards for us timestamp fetishists and other users of make.)

WARNING: Because Unison does a file-by-file bidirectional synchronisation between the folders in question, it has been known to cause problems: in particular, collections of files which should be treated as a set, such as a local git repository or one's svn directories, can get subtly corrupted. For this reason, the Unison Homesync offering formerly made available for Astrophysics laptops has been withdrawn.

This section is retained mainly to issue this warning.

Taking in mind the above caveat, synchronising between machines, or between mounted filesystems, is as simple as:

unison ssh://farname@farsystem:/fardir /data/system/user/neardir
unison -times /media/usb1/foo /data/system/user/bar

(The latter propagates timestamps.) This is by default interactive. The text-based client is a mite user-hostile; GUI clients are also available, though not necessarily present by default on all client systems. For more information on unison, please see:

  • "man unison" (very short);
  • "unison -doc tutorial | less" (longer than somewhat, hence the pipework);
  • http://www.cis.upenn.edu/~bcpierce/unison/index.html, Unison's home site, where those using their own laptops will also find native MS-Windows and Macintosh client binaries.

CAVEATS

Some of these may belong more properly in a different Article. Please excuse the mess.

Availability:

You will, of course, need a copy of whichever file-transfer program you use at both ends of the transaction. Availability varies: the command-line and GUI clients may not be fully present on MacOS X or Linux desktop systems; they would have to be installed on your own personal laptop; and they are less likely to be there by default under MS-Windows. For Unison, you'll also have to have a fully-functional ssh setup, which for MS-Windows probably means installing Cygwin (unless putty would suffice or the binary client package does all this for you .... information hereby humbly requested).

Wildcards:

With rsync, do not be tempted to use a "*" after the directory name. For example:

rsync -rlpt -vP -m -z \\
    username@scp-astro.physics.ox.ac.uk:~/ \\
    /Data/username/oldhome/

copies the entire contents of the named directories. However, if you instead say:

rsync -rlpt -vP -m -z \\
    username@scp-astro.physics.ox.ac.uk:~/* \\
    /Data/username/oldhome/ # WRONG

then the "*" wildcard fails to spot "dot-files" (eg .tcshrc or .gitignore) or similarly-hidden directories (eg .ssh/).

Upshot: your backup appears to have succeeded, but isn't all there. (There's also the insidious possibility that the wildcard will match something at the receiving end, in which case all bets is off. If in doubt, use single quotes, 'thus'.)

Please note: Under some circumstances, not copying .tcshrc or .ssh/ is a Good Thing. Use caution, a "staging" local directory, and probably something of the form:

diff -urs /Data/stagedir/ $HOME/ | less

Apparent failure to copy:

The wise user will repeat the comand after a complete transfer using sync: not because rsync doesn't work, but because some partitions are formatted to a non-case sensitive system, and because of a quirk with timestamps on Microsoft filesystems. Telltale signs of such problems include rsync never being happy, and insisting there are still changes to be made just after a fresh sync.

  • If you get an avalanche of messages on the second time through, use tee to save the results to a file, so you've got chapter and verse on where rsync thinks the problem lies:

    rsync [....] |& tee rsync.try.out (tcsh)
    rsync [....] 2>&1 | tee rsync.try.out (bash)

  • If you have two files or directories whose names are only distinguished by the case of the letters (eg reduced_data and Reduced_Data, or .../lib/ and .../Lib/) in the same directory at the sending end, they look the same to the operating system, so they overwrite each other (or combine their contents, with extra chances for mischief) every time you rsync. This is an insidious problem, for which the only true fix is "so don't do that". This is admittedly easier said than done if it's been done on your behalf.

    [**]The Data partitions on Astrophysics's MacOS X systems are often case-sensitive, as historically some software got distressed otherwise. Sadly, home directories (whether local or network-mounted), and the system partition, have to be case-insensitive: case sensitivity breaks certain other software (most notoriously, Adobe's).

    [**]Linux's local filesystems are almost invariably case-sensitive. As and when we can offer network home directories for Linux systems, this may change.

  • If the target filesystem is VFAT, then you (like me) may fall foul of a combination of minor details: timestamps on VFAT have a two-second minimum quantum; the UNIX datestamp is precise to the second; and the sourcefile's timestamp, measured in seconds, has a one-in-two chance of being an odd number. The fix here is to add the argument "--modify-window=1", to tell rsync that timestamps a second apart are near-enough equal. (Alternatives using checksums at each end are left as an exercise, but are known to be computationally expensive.)

  • Before assuming the worst, check a test file with something of the form:

    diff sourcedir/testfile destdir/testfile
    ls -l sourcedir/testfile destdir/testfile

    This will permit you to distinguish between the above two cases, or perhaps suggest something else.


Filenames, and file sizes:

If your external USB drive is formatted as FAT32 (alias VFAT, ie as-Microsoft), the person doing the mount gets to own the files. The Bad News is that certain characters (notably ":") tend to appear in names of files produced at certain European observatories, but are forbidden in FAT filenames. Your only reliable choice here would be to create a "tarball" (using tar and gzip), transport that on the USB drive, and untar it at the other end. This is agreed to be awkward: you can't then use the files in-place on the USB drive, nor easily check which files in the tarball are older or newer than your local copy. There's also the minor detail that the maximum size for a file on FAT32 is 4GBytes, as I've had brought forcibly to my attention on multiple occasions, so you'd need to create N separate tarballs, each under 4G in size. This also is agreed to be awkward, especially if your individual files are pushing the 4GBytes limit.

Please note: It's possible to reformat an external USB drive to use HFS+, or Linux's ext2/ext3/ext4, or MS-Windows's NTFS. Each of these will permit larger files and outré filename characters, but all of them will erase the current contents of the partitions, and none of them are as widely or fully supported by other operating systems as FAT32. In particular, partially-supported partitions may be mounted read-only.

File ownership:

Those using external USB hard drives with MacOS X systems have HFS+ available to them, but the numeric UIDs and GIDs will inevitably differ between OS X and Linux unless BOTH systems are installed in a physics-standard way. The fix for "reowning" copied files mentioned above is specific to rsync, and only works when copying from the USB drive; those using Unison, and those copying to HFS+, may have to investigate "man chown" instead for retrofixing ownerships. If there's a corresponding spell to cast over unison itself, or if the Linux HFS+ drivers can do some esoteric user-identity mapping trick at mount time, please let us all know.

Background: USB drives which are plugged into a Mac are automounted with the "noowners" option, so whoever happens to be logged on appears to own the files. This is a feature, though not (to my paranoid mind) a very secure one. The fun starts when MacOS X writes files on said drive, as it writes them using the identity of a user who is not of this Earth.

Categories: Apple | Astrophysics | HOWTO | Laptops | Linux | Mac | Network | OS X | Software