A while ago I spent some time in order to generate the perfect git mirror of mozilla-central, and it's now up on github. Here's the story behind the repository. If you're not interested in history lessons, scroll down to the “What does this mean for me” section.
Jeff spent quite some time in order to convince me that git is superior to mercurial. He was right. And I'm glad I listened to him. So I decided that I want to use git for most of my Mozilla development. Some time before that, Chris Double had gone through the trouble of creating a git mirror of mozilla-central using hg-git, so I started to use that repository. All was well until one day Jeff taught me about grafting two git repositories. What grafting means is to replace the parent of a commit in one repository to point to a commit in another local repository. Jeff had created a git mirror of the old Mozilla CVS repository. The curious reader may realize what this meant: you could graft the git mirror of mozilla-central against the old CVS mirror, and you would get yourself a repository containing all of Mozilla's history. That's right! No more cross-blaming stuff on hg and bonsai. You would just do git log or git blame in the grafted repository, and things would work as if we had never abandoned multiple years of the project's history when we migrated from CVS to mercurial. Beautiful!
Now, grafting two repositories has some problems. The graft support has been added as an after-thought, which means that you cannot publish the grafted repositories so that others can consume them, and you might occasionally find that some git commands do not properly handle grafted repositories. So, I took it upon myself of sharing the joy of the full history with everyone else in the project. That, was easier said than done!
We discovered that git's history rewriting tool, the filter-branch command, doesn't really know about grafts, which has this exciting side-effect that if you issue a filter-branch command in your grafted repository starting at the parent of the graft point, filter-branch will create a full alternate history of your repository, with different commit SHA1's (since the parent SHA1 has changed), which is a real non-grafted git repository. So I took Chris’ and Jeff's repository, grafted them together, and started running filter-branch to convert the grafted repository into a regular repository. After about a day or so (yes, git filter-branch is that slow), I had a nice error message complaining that I have a commit with an invalid author line. What the heck, the reader might ask? It turns out that mercurial is a hipster when it comes to author lines for commits, and git is bipolar.
Mercurial treats the author information for a given changeset as basically a free-form text field. You can put anything you want in there, and Mercurial would store it for you, and display it as is. What you see is what you put into it (although, not necessarily what you intended.) Git, however, has a stricter notion of what an author line could be. To put it roughly, git expects the author information to be in the form of “Name firstname.lastname@example.org” (yes, it won't even allow multiple people take credit for a commit!). And the author lines that hg-git produces from mercurial changesets were sort of sanitized to conform to that format, but not quite. And weird things that we have in our mercurial history such as this changeset from Ms2ger confused hg-git. At this point, it was very easy to blame hg-git, or at least Ms2ger, but being the responsible person that I am(!), I decided to delve a little bit deeper into this. Having looked into git's source code, it turns out that most of its high-level tools enforce this author line format, but some of its internal tools don't, and readers who know anything about git's source code know that looking for anything remotely similar to consistency in its code is like looking for a needle in a pile of haystack where you know there's no needle to be found. Hence the bipolarity diagnosis for git. Now, it was time to get practical and address the problem somehow.
So I decided to fix hg-git, because, “what could possibly go wrong?". The fix itself was fairly easy, even for somebody like me who only pretends to know Python (and really just looks up all of the language constructs and library functions on Google and just types them in his text editor.) And I did that, and I tested my fix, and it avoided the Ms2ger problem! So I went ahead and attempted to convert mozilla-central's mercurial repository using my patched hg-git. Little did I know that hg-git is the slowest piece of software ever written! After 3-4 days, it finally finished converting the seventy something thousand changesets in the source Mercurial repository. And after a day of running git filter-branch (remember what the workflow looks like?), I came in to the office one morning to find out that filter-branch has died on another commit, further down the history line, again, because of a bad author line.
To keep this blog post short enough so that you can actually download it on a fast connection, I had to do this whole round a few more times, each time fixing more problems that the hg-git authors did not anticipate. With a turn-around time of about a business week for every single time, you can probably guess why I grin why people complain these days about waiting for 4-5 hours for their try server results.
Finally I had fixed all of the hg-git bugs that the mozilla-central history helped me catch. And being a good open source citizen and all of that, I upstreamed my hg-git patches (well, really here‘s where I upstreamed them, since I was confused on the patch submission process for hg-git!).
So, I finally had a full git mirror of mozilla-central containing all of Mozilla's history. This was maybe a couple of months after I started this project (which I was working on in my free time!), and I had shed enough blood and tears and I thought that it's useful enough for people that I sneaked it in under mozilla's github account.
Then, I decided that a git mirror that does not update based on the main repository is not worth much. So I spent a little time to show off my lack of shell scripting skills to create a cron script which would update the git mirror based on the stuff pushed to mozilla-central. A few months later somebody (sorry, don't remember who… maybe jlebar?) pinged me and asked me whether my mirror has an inbound branch. I said no, and I wanted to explain why I don't really have time to add that, but I realized that it would take me less time to modify the mozilla-central update script to also include mozilla-inbound, so I sat down and did that, and now I had a branch tracking mozilla-inbound!
I didn't really talk a lot about the existence of the repository so much to people, mostly because I wanted to write this blog post first (and it took me only about a year to do that) until some time ago when Andreas Gal told me that the b2g project is based on my repository, and there's apparently tons of people who are using this repository for their day to day development. This was both nice to hear and frightening at the same time (scroll down to the Fun Facts section to know why!), and this motivated me to finally sit down and write this blog post (about a couple of months after talking to Andreas… I know, I know!).
What does this mean for me?
If you're a Mozilla developer who's fed up^H^H^H^H^H^H prefers to use git as opposed to mercurial, just clone the git mirror and start using git. There's even an inbound branch in that repository if you really want to live on the bleeding edge.
If you're a Mozilla developer who has been using Chris’ git mirror, you should switch to this mirror instead, since Chris has stopped updating his mirror. The update should be fairly painless if you pull my mirror's master branch and rebase your local branches on top of it. Once you have rebased all of your branches, git gc will kick in at some point and clean out the old history that you're no longer using.
If you're interested in having a repository with the full history of the Mozilla project, including the CVS history, either clone the git mirror and run git log and git blame locally, or use the github UI for blames (here's a demo). But be warned that github is sort of slow for large projects, so you will be much better off with a local clone and running git blame (or fugitive.vim, if you're a vim user.)
If you're interested in following my steps to do your own conversion, I have good news for you. I have documented the detailed steps for this conversion from the bare CVS and mercurial repositories to the final git repository. That directory also includes all of the files and resources that you will need for the conversion.
If you're interested in more goodies available for this git mirror, check out the latest git-mapfile, the latest git commit and the corresponding hg changeset (and the latest inbound git commit and the corrsponding mozilla-inbound hg changeset). The mozilla-history-tools repository is being constantly updated as my update scripts pick up newer changesets from mozilla-central and mozilla-inbound to always point to the latest commits and git-mapfiles.
The update scripts are running on my Linux desktop machine at the office. The mozilla-central update script runs every 30 minutes, which is much slower than the mozilla-inbound update script which runs every 10 minutes. The box is connected to a UPS to make sure that we have a negligible reliability for power interruptions. I do very little monitoring on the update scripts to make sure that they continue to run smoothly. That monitoring includes glancing over the emails that cron sends me from the stdout output of the update scripts, and fixing up the problems as they come up. As I have fixed more and more problems, the updates are running fairly smoothly and without any major problems for the past few months. I did the original work to get the repository in my free time, and I did it because I thought it was useful and I personally wanted better tools for my day-to-day job. I am glad that others have found it useful.