Migrating Projects from SVN to Git

PLEASE NOTE:

New projects should use gitlab.physics instead (please see the separate Article on that subject);

This Article assumes the reader is familiar with svn, and with using git on one's own system. Day-to-day use of git in general, and setting-up and tweaking one's own gitlab.physics account and projects, are documented elsewhere. We also assume that the scm.physics project to be copied has no branches or tags, or that they're used only in the svn-approved manner. (If only the directory trunk in your project on scm.physics is non-empty, you aren't using branches or tags.) If in doubt, please contact itsupport (at) physics.

Please see also:

Executive Summary

The basic difference between git and svn is that an svn checkout relies on an upstream svn server (eg scm.physics) for its history, and retains only a cached copy of the last checkout on your system; in contrast, every git clone is a complete single-project repository in its own right, with full history and all commit attributions, which can optionally use a git server (eg gitlab.physics) as a remote for collaborative and/or backup purposes.

The fundamental process of migrating a project is:

  1. Use git svn (the svn subcommand in the git suite) to clone the project from the svn server to a directory on your local system.

  2. Handle various differences between svn and git.

  3. Push the resulting git project from your local system to the git server.

Please note that this doesn't require privileged access to either server at any stage.

Creating a local git clone

The following will create a git clone in the directory base-directory/project-name. This must be entirely separate from any existing svn checkouts, unless you enjoy disentangling wet spaghetti.

Please resist the urge to prune the svn project first. The cloning process copies the entire svn history by default; removing large and/or redundant branches in the obvious manner only makes their tips inaccessible to svn, without removing the history. We'll get back to this in the collapsible section below about excluding svn branches.

  • Set up a new empty project on gitlab.physics, ready to receive your files, as documented elsewhere. Resist the temptation to populate it.

  • Ensure you and your collaborators have all committed any pending updates to the svn server. (It is possible to use git svn to help keep svn and git versions of a project in sync, but doing so is decidedly awkward, and has not been tested for compatibility with the suggestions in this Article.)

  • Prepare a translation file to map your scm.physics identity, and those of any collaborators, to those for gitlab.physics, of the form:

    yourname = Your Name <your.name@physics.ox.ac.uk>
    collab1 = Collaborator One <first.collab@physics.ox.ac.uk>
    collab2 = Collaborator Two <second.collab@physics.ox.ac.uk>
    root = Your Name <your.name@physics.ox.ac.uk>

    .... where you should replace yourname by your identifier on scm.physics, and collab1 and collab2 are likewise the scm identities of your collaborators. (The last line maps root, your secret collaborator in whose name gforge created your project, to yourself. It's your project, right?) Put this file, say authors-transform.txt, in base-directory.

    This is the most tedious part of the exercise. Code to help you with this is believed to be in preparation, but in the meantime, saying the following may help to remind you of your collaborators' scm.physics identities:

    cd svn-checkout-directory
    svn log --xml | grep author | sort -u | \
        perl -pe 's/.*>(.*?)<.*/$1 = $1_streetname <$1_email>/' | \
        tee ../authors-transform.txt

    .... (where I've had to use backslash-escaping to fold the second line, twice) will produce something of the form:

    yourname = yourname_streetname <yourname_email>
    collab1 = collab1_streetname <collab1_email>
    collab2 = collab2_streetname <collab2_email>
    root = root_streetname <root_email>

    .... ready for you to manually modify to match reality, including assuming root's mantle. (Remember yourname etc are placeholders, not keywords.)

    Please note:

    • You can't miss anybody out of this list. Failure to mention any contributors will yield mysterious complaints later, when git svn finds it doesn't know to whom to attribute their commits.

    • If you have commits from long-gone contributers, it's possible to give them fake dummy identities instead (eg old-contributor@example.org), or map them onto someone else (as we did with root above). It's up to you to decide whether (from git's point of view) to give your old contributors unusable aliases, or to steal their contributions pretend they didn't exist; either way, git svn (and gitlab thereafter) will happily go along with the changes. Your humble Author's instinct is to use originators' then-current names for attributing their contributions, even if they are no longer contributing members, to keep history straight.

    • We are setting up identities in git at this point; giving people access to the repo in gitlab is a matter of setting up login identities on the gitlab server, which is an entirely different matter. Were it otherwise, Linus Torvalds would have a semi-infinite number of people who could directly log into his personal git repository, one for every contributor to the Linux kernel. (Apologies for belabouring this distinction, but it's not always immediately obvious, and has been known to bend the brain.)

  • Use git svn to create a fresh clone of your existing project, complete with full multi-collaborator history, in a new subdirectory on your own system.

    cd base-directory
    git svn clone \ 
        -A authors-transform.txt \ 
        --username yourname \ 
        --stdlayout \ 
        --prefix="svn/" \ 
        https://scm.physics.ox.ac.uk/svn/yourproject \ 
          project-name \ 
             |& tee -a yourproject.clone.out

    Please note:

    • This assumes that your svn project has the standard trunk, tags and branches layout. Please see the collapsed section below for advice about nonstandard layouts before continuing, to understand what this means, and what to do about it if necessary.

    • If your svn project has some unwontedly large files in it which should never have been committed, please see the collapsed sections below about excluding files and directories or (more drastically) excluding branches before continuing.

    • The second line has again had to be folded; the trailing "tee into a file" pipework is part of it, and uses the correct syntax for both bash and tcsh (for completeness, bash also accepts the Bourne-shell standard form "2>&1", but tcsh is less forgiving).

  • Wait a wee while: first git svn clone thinks for a bit, then it patiently checks out every svn commit on all branches and tags, cross-compares them to reconstruct the commit and branching history, lodges the corresponding sets of references in the git clone it's creating in the directory project-name, and tells you what it's doing in loving detail. This is agreed to be nerve-racking. For extra confidence, page through yourproject.clone.out afterwards, eg using less, to check quite what happened.

    Warning: The above procedure has been seen (twice) to stress OS X enough for Terminal to lock up. If this happens, close Terminal and restart it. We can check at the end whether the clone was fully successful; if not, we can always start again, possibly instructing git svn clone to ignore huge binary blobs, as suggested below in the collapsible section on excluding files.

    More information as we discover it.

    Nonstandard layouts

    Not every svn project has the standard trunk/branches/tags layout. I shall use as my example a metaproject, encompassing multiple related-but-independent subprojects, where each subproject appears alongside trunk in the svn filestructure, like this:

    branches
    proj1
    proj1a
    proj2
    proj-libs
    proj-other
    tags
    trunk

    You can produce a single git project from this by omitting --stdlayout in the git svn clone incantation above. However, git is intended for single projects (eg the Linux kernel); metaprojects of the above form tend to be unwieldy, and may be too large for git or gitlab to cope with unaided. (On the one occasion I've tried that, macOS went to lunch for some hours after the clone command had nominally finished, as git decided to garbage-collect the result in the background.)

    Happily, many such projects tend to have the "embarrassingly parallel" nature, where all that's in common is a set of libraries or build tools, or possibly not even that. If, after discussion with your svn collaborators (and with Central Physics), you instead agree to split up the metaproject, you can proceed by replacing the --stdlayout argument to git svn clone with:

    --trunk=proj1 \

    for the first subproject,

    --trunk=proj1a \

    for the second, and so on. (Don't forget to add the trailing backslash if you've had to fold the line, and don't forget to use a different filename for the logfile each time. As you'll be doing all this repeatedly, now's the time to brush up on your shell-scripting skills.) Upshot: you will end up with N independent git projects, which of course works best if the projects really are effectively freestanding. You'll then, for example, be able to check out proj2 and proj-libs side by side, and use the latter to support the former.

    Excluding files and directories

    One can exclude files from an svn project by setting the svn:ignore property. Sadly, this has to be done for each directory separately, and it's all too easy to forget to set it, then inadvertently add some inappropriate (and possibly massive) files or subdirectories to the project.

    If you've fallen victim to this, you can prevent this propagating into the git repo you're about to produce by adding one or more options of the following form. Please note: the files will be transferred from the svn server, then discarded at the receiving end. (Please see the next collapsed section for a better way to exclude entire branches, which doesn't have this drawback, if that's more appropriate.)

        --ignore-paths='sdf$' \
        --ignore-paths='/tempdir/' \

    .... including the trailing backslashes, if you've had to fold the line.

    • The first excludes all files whose entire pathname ends in 'sdf'; the equivalent for excluding DLLs is left as an exercise. Don't forget the trailing dollar sign: this is a regular expression, not a shell wildcard.

    • The second excludes a directory named tempdir anywhere in the project. You may in practice need to be more specific by adding the directory's parent, and perhaps grandparent. If in doubt, check the git repo afterwards, and if necessary adjust the regular expression and repeat.

    • .... and don't forget to add a trailing backslash to each clause if you've had to fold the line.

    While you're thinking about it, to make sure such accidents don't happen in your git repo, now's the time to make a note to add one or more corresponding lines to .gitignore (now using shell wildcarding syntax), which in this case would be:

    *.sdf
    tempdir/

    Excluding svn branches

    It is entirely possible you may wish to not copy certain large and/or redundant branches from svn. Attempts to exclude them by pathnames (as in the previous section) will work, but the undesired files will be transferred from the svn server anyway, then discarded at the receiving end. Upshot: git svn clone produces a bunch of empty commits, and takes just as long to do so as if they weren't empty.

    The answer to this comes in two parts: transfer only the trunk, then add the desired set of branches to the git configuration, and fetch their contents.

    • Replace the --stdlayout line in the git svn clone incantation by:
          --trunk=trunk \
          --tags=tags \

      This omits all branches from the cloning, but copies over the tags (feel free to drop the second line if you're not bothered about those).

    • Now do the cloning, as amended, then add the desired branches to the configuration, by something of the form:
      cd project-name
      git config --local
      git config svn-remote.svn.branches \
          'branches/{b1,b2,b3}:refs/remotes/svn/*'
      git config --local

      .... where I've had to fold the third line at the backslash. Adjust the comma-separated list of branch names to taste.

    • If you're satisfied with the result, then:

      git svn fetch

      .... to bring over the desired branches. This may take a wee while, but should be quicker than cloning the entire project.

  • When the music stops (and with the above caveats):

    cd project-name

    .... and give the new git clone of your project a quick lookover: the directory contents should be the same as there would be in a fresh checkout of the trunk of your latest svn commit (apart from the respective sacred directories .svn/ and .git/). You can take the opportunity to clean it up a little, eg by adding or updating a README file in the base directory, and perhaps a .gitignore file as mentioned above (but see below about producing a .gitignore file from svn:ignore properties). We suggest, though, that you resist the temptation to do major surgery, at least until you've pushed your project to the git server.

    To see what git thinks is there, say:

    git branch -a

    This should show you something of the form:

    * master
      remotes/svn/my_first_branch
      remotes/svn/tags/release_1.0
      remotes/svn/trunk

    In this, master is a local branch corresponding to trunk which has been checked out for you, and branches and tags are named as themselves; the distinction between svn branches and tags should be clear. Note, by the way, the beneficial effect of the --prefix="svn/" argument: we would otherwise see:

    * master
      remotes/origin/my_first_branch
      remotes/origin/tags/release_1.0
      remotes/origin/trunk

    .... which is likely to lead to confusion when the time comes for origin to refer to your git server.

On Tags and Branches

To convert a remote svn branch to a local git branch, use something of the form:

git branch my_first_branch refs/remotes/svn/my_first_branch

(Don't forget svn/trunk is already checked out as master.) It's possible to automate this if you've a huge list, but it's instructive to do it at least once by hand for practice, after which you can script the process. Beware of spaces which may have crept into branch names (these will appear as "%20", but you should use underscores instead). As with all chainsaws, watch your wrists.

Tags are a matter of personal taste or of house policy: a tag represents a snapshot in time, while a branch can (eg) accumulate production bug fixes. The main practical difference is that attempts to check out a tag yield a 'detached HEAD', and any work therein won't be saved without further effort.

If you wish to convert an svn tag to a git tag, use the following. Remember this tag will exist only in the svn remote information, and won't be saved to the git server later.

git tag -a -m"Converted svn tag"  \ 
    release_1.0 \ 
    remotes/svn/tags/release_1.0

If you've been naughty and committed updates to the svn tag, or you wish to reserve that option under git, convert it to a git branch, thus:

git branch release/version_1.0 \ 
    remotes/svn/tags/release_1.0

(The suggested subdirectory-like branch naming will help segregate production-release branches from development ones, but means slightly more typing. It's your call.)

We suggest you use both tags and branches at the same time. The following checks out an svn tag as a git branch, and tags the git branch by the svn tag's original name; both can then be pushed to the git server.

git branch release/version_1.0 \ 
    remotes/svn/tags/release_1.0
git tag -a -m"Converted svn tag"  \ 
    release_1.0 \ 
    release/version_1.0

Handling svn:ignore

Nontrivial svn projects will use the svn:ignore property to selectively ignore certain files, and git uses the text file .gitignore for the same purpose. However, git svn clone will by default ignore all svn properties other than svn:executable.

To view the svn:ignore property of the base directory, say (while sat in it):

git svn propget svn:ignore

.... which leads to the following exceedingly quick-and-dirty way of creating .gitignore for yourself:

git svn propget svn:ignore | tee -a .gitignore

You may well need to repeat this exercise in subdirectories. Don't forget .gitignore is an ordinary file, so you'll need to add it into git in the normal way, both initially and whenever it's modified, so that the next git commit saves it:

git add .gitignore

This also pertains only to the current branch. If you've more than one, you'll need to repeat the whole thing from the top for every branch you're interested in.

HOT NEWS: the command:

git svn show-ignore

.... in the root of the git repo will show you the svn:ignore properties in all directories (in a git-compatible form), and:

git svn create-ignore

.... will add a matching .gitignore into each directory. These need to be checked in as a git commit; and this needs to be repeated for each branch of interest in which it applies.

Pushing to the git server

At this point, you've got a local git clone corresponding to your svn repository. Now to populate the project on gitlab.physics which you created at the start of play:

  • Tell your local git clone to use your new project space on gitlab.physics as a remote, associating the (traditional) name origin with it:
    git remote add origin \
          git@gitlab.physics.ox.ac.uk:project-name.git

    You can copy-paste the full URL from your project's homepage on gitlab.

  • Now push the master branch of your project to the server:

    git push -u origin master

    The argument -u origin (in full: --set-upstream=origin) tells git that the remote-name origin is the default upstream for the local branch master for subsequent git push invocations on the same branch.

    If you've got multiple branches and git tags, you can instead push them all at once (erm, twice):

    git push --all origin
    git push --tags origin

    If you don't wish to push git tags to the server, drop the second command. The defaults for git are that tags are transferred when copying from the server (by git fetch etc), but are only sent to the server by explicit request. This can be made to make sense.

Quality Assurance

Paranoia Dept: If OS X has lunched the Terminal process under which you did git svn clone, or if (like your humble Author) you have an abiding distrust of magic inscrutable software, you'd be wise to compare and contrast the svn and git versions of your project, as they appear on the respective servers.

  • Create two entirely separate and completely fresh copies somewhere else (eg base-directory/trial):
    cd somewhere_else
    svn co --username yourname \ 
        https://scm.physics.ox.ac.uk/svn/yourproject \ 
          checkout_svn_dir
    git clone \ 
        git@gitlab.physics.ox.ac.uk:project-name.git \ 
          checkout_git_dir
  • do a recusive diff of the two copies, eg by:

    diff -ur checkout_svn_dir/ checkout_git_dir/ | less

    This will show you which files are present in one but not the other, and which files are present in both but differ in content.

  • The true paranoid (you know who you are) would loop back to the top and repeat this, once for each active branch of interest.

If you can account for all the differences, congratulations: the git server's copy of your project matches the svn server's sufficiently closely. If you can't, that's either a bug or a lack of clarity on our part, or just possibly a misunderstanding on yours; in either case, please send full details to us at itsupport@physics, and we'll investigate.

Postlude

At this point, you'll have a single-user Project on gitlab.physics which may happen to have others' commits in it. For how to proceed, please see:

Once you (and any collaborators you may have) are happily using gitlab.physics, feel free to clone the project from gitlab.physics into another fresh directory on your system (to leave any remaining links to scm.physics behind), and work in that. Continued use of git svn (to help keep svn and git versions of a project in sync) is possible, but this Article is already too long.

Categories: Development | HOWTO | agile | git | project management | svn