Monthly Archives: April 2011

Assisted landing of patches on mozilla-central

Imagine this for a second.  You work on fixing something, get the required reviews, and run your patch on the try server.  Everything looks good.  Then you set a flag or something on the bug, and go home, and enjoy your evening.  The next morning, you’re reading your bugmail while enjoying your coffee, and you see a message from landingbot@mozilla.org saying that the patch has successfully landed on mozilla-central, you smile, and wonder how people used to spend 4 hours watching the tree (and possibly getting yelled at by me or philor in the meantime) when they wanted to land something on mozilla-central.  This, my friend, is what we need to move towards, I think.

Now, I’m not entirely delusional here.  We have a very large number of tests testing all aspects of our code, including correctness and performance.  We usually use these tests to judge the quality of a patch.  The results of these tests can be used by machines to make a judgement about the quality of a patch.  So, we can put these tests to use for automating this process.  Here is roughly what I have in mind:

  1. We would have a bot which constantly watches Bugzilla for automated landing requests.  Once such a request is found, it gets added to a queue.
  2. For landing each change, the bot looks at the head of the queue, imports the patch (or hg bundle, in case of bugs with multiple patches) into a clone of mozilla-central.  If the import process is not successful (because the patch has been bit-rotten for example), the bot aborts and reports the problem on the bug.  Otherwise, the bot pushes the changes to the try server.
  3. The bot would watch the try server for results of the push.  If the push has more than one orange job, the bot aborts the landing process and reports the problem on the bug.  If the push has only one orange job, the bot retriggers that job, and reports the possibility of coming across an intermittent orange on the bug, and goes back to watching the try server push.  If the push is all green, the bot takes the change, transplants it on a fresh clone of mozilla-central (and aborts if the patch has been bit-rotten since step 1) and pushes it to mozilla-central.
  4. The bot would watch the mozilla-central push.  For any orange job, the bot retriggers it.  If the second run of the job goes green, the bot reports the orange as intermittent on the bug.  Otherwise, the bot backs out the change.  When the push gets one green run for every job, the bot reports success on the bug, and markes the bug as RESOLVED FIXED.

"Yeah, right!", you would say, "Like there’s ever a push which doesn’t see any intermittent orange!".  But if you’ve been watching the tree closely during the past week or so, you would have noticed that there are a lot of jobs which do not see any intermittent oranges (I’m not talking about oranges, or worse, reds, caused by people landing untested stuff, those will be reliably caught by our good robot).  These pushes are still not the majority of pushes, but we’re getting there.  Slowly, but surely.  Take a look at this image, which can be found here.

Orange Factor going down

The situation is not improving on its own.  It’s improving because of all of the wonderful developers who are working on fixing intermittent orange bugs in the area of their expertise (and some brilliant people who even go one step further and fix oranges in the areas of code unfamiliar to them)!  You can help too.  But more on that in a future post.

Once we reach an average of 1 intermittent orange per push, we could make such a plan work for real.  I don’t know about you, but this makes me really excited.  I think we all have better stuff to do than watching the tree for hours after we land something.

Posted in Blog Tagged with: ,

Avoiding intermittent oranges

Writing tests which are resilient against intermittent failures is hard.  In the process of trying to fix a large number of intermittent orange bugs, I’ve found out that a large portion of them are just caused by mistakes in writing tests, and almost all of those mistakes fall into commonly reoccurring patterns.  It’s hard to avoid those mistakes, unless you know how they lead to intermittent oranges, and how to avoid them.

In order to share my experience about what type of patterns could cause a test to fail intermittently, I’ve gathered a list of common intermittent failure patterns on MDN, and I urge everybody who writes tests for the Mozilla project to go ahead and take a look at that list.  I think if the test writers and reviewers have those items on mind when creating or reviewing a test, we can reduce the amount of new tests which are susceptible to intermittent failures dramatically.

And please make sure to add to the list if you know of other such patterns that I’ve missed.

Posted in Blog Tagged with: , ,