My experience adding a new build type using TaskCluster

TaskCluster is Mozilla's task queuing, scheduling and execution service. It allows the user to schedule a DAG representing a task graph that describes a some tasks and their dependencies, and how to execute them, and it schedules them to run in the needed order on a number of slave machines. As of a while ago, some of the continuous integration tasks have been runing on TaskCluster, and I recently set out to enable static analysis optimized builds on Linux64 on top of TaskCluster. I had previously added a similar job for debug builds on OS X in buildbot, and I am amazed at how much the experience has improved! It is truly easy to add a new type of job now as a developer without being familiar with buildbot or anything like that. I'm writing this post to share my experience on how I did this. The process of scheduling jobs in TaskCluster starts by a slave downloading a specific revision of a tree, and running the ./mach taskcluster-graph command to generate a task graph definition. This is what happens in a “gecko-decision” jobs that you can see on TreeHerder. The mentioned task graph is computed using the task definition information in testing/taskcluster. All of the definitions are in YAML, and I found the naming of variables relatively easy to understand. The build definitions are located in testing/taskcluster/tasks/builds and after some poking around, I found linux64_clobber.yml. If you look closely at that file, a lot of things are clear from the names. Here are important things that this file defines:

$inherits: These files have an single inheritance structure that allows you to refactor the common functionality into “base” definitions.
A lot of things have “linux64” in their name. This gave me a good starting point when I was trying to add a “linux64-st-an” (a made-up name) build by copying the existing definiton.
payload.image contains the name of the docker image that this build runs. This is handy to know if you want to run the build locally (yes, you can do that!).
It points to builds/releng_base_linux_64_builds.py which contains the actual build definition.

Looking at the build definition file, you will find the steps run in the build, whether the build should trigger unit tests or Talos jobs, the environment variables used during the build, and most importantly the mozconfig and tooltool manifest paths. (In case you're not familiar with Tooltool, it lets you upload your own tools to be used during the build time. This can be new experimental toolchains, custom programs your build needs to run, which is useful for things such as performing actions on the build outputs, etc.) This basically gave me everything I needed to define my new build type, and I did that in bug 1203390, and these builds are now visible on TreeHerder as “[Tier-2](S)” on Linux64. This is the gist of what I came up with. I think this is really powerful since it finally allows you to fully control what happens in a job. For example, you can use this to create new build/test types on TreeHerder, do try pushes that test changes to the environment a job runs in, do highly custom tasks such as creating code coverage results, which requires a custom build step and custom test steps and uploading of custom artifacts! Doing this under the old BuildBot system is unheard of. Even if you went out of your way to learn how to do that, as I understand it, there was a maximum number of build types that we were getting close to which prevented us from adding new job types as needed! And it was much much harder to iterate on (as I did when I was working on this on the try server bootstrapping a whole new build type!) as your changes to BuildBot configs needed to be manually deployed. Another thing to note is that I found out all of the above pretty much by myself, and didn't even have to learn every bit of what I encountered in the files that I copied and repurposed! This was extremely straightforward. I'm already on my way to add another build type (using Ted's bleeding edge Linux to OS X cross compiling support)! I did hit hurdles along the way but almost none of them were related to TaskCluster, and with the few ones that were, I was shooting myself in the foot and Dustin quickly helped me out. (Thanks, Dustin!) Another near feature of TaskCluster is the inspector tool. In TreeHerder, you can click on a TaskCluster job, go to Job Details, and click on “Inspect Task”. You'll see a page like this. In that tool you can do a number of neat things. One is that it shows you a “live.log” file which is the live log of what the slave is doing. This means that you can see what's happening in close to real time, without having to wait for the whole job to finish before you can inspect the log. Another neat feature is the “Run locally” commands that show you how to run the job in a local docker container. That will allow you to reproduce the exact same environment as the ones we use on the infrastructure. I highly encourage people to start thinking about the ways they can harness this power. I look forward to see what we'll come up with!