Quantum Flow Engineering Newsletter #9

It's been 10 weeks since I have started writing these newsletters (the number in the title isn't an off by one error, there was a one week hiatus due to a work week!). We still have quite a bit of work ahead of us, but we have also accomplished a good amount. Finding a good metric for progress is hard, but we live and breathe in Bugzilla, so we use a bug-based burn-down chart. As you can see, we are starting to see a decrease in the number of open bugs, and this is as we are actively adding tens of new bugs to the pool in the weekly triage meetings.

The other thing that this burn-down chart shows is that we need help! Very recently Kan-Ru came up with the great idea of creating the qf-bugs-upforgrabs tracker bug. These are reasonably self-contained bugs that require less specific domain knowledge and can be worked on by anyone in a reasonable time frame. Please consider taking a look at the dependency list of that bug to see if something interests you! (The similarity of this tacker bug to photon-perf-upforgrabs isn't an accident!)

On the telemetry hang reports data collection, the new data from hangs of 128ms or longer have been coming in, but there have been some wrinkles in actually receiving this data, and also in receiving the hang data correlated to user interactivity. Nika Layzell has been tirelessly at work on the BHR backend to make it suit our needs, and has been discovering the edges of computation limits in order to symbolicate the BHR reports on people.mozilla.org (now moved to AWS!).

I realized we haven't had a performance mini-story for a while – I sort of dropped the ball on that. Running over this bug made me want to talk about a pretty well known sort of slowness in C++ code, virtual functions. The cost of virtual functions comes from several different aspects, firstly they effectively prevent the compiler from doing inlining the function which enables a host of compiler optimizations, essentially by enabling the compiler to see more of the code and optimize more effectively based on that. But then there is the runtime cost of the function, which mostly comes from the indirect call. The majority of the performance penalty here on modern hardware is due to branch midpredictions when different implementations of a virtual function get called at a call site. You should remember that on modern desktop processors, the cost of a branch misprediction can be around 15-20 cycles (depending on the processor) so if what your function does is very trivial and it has many overrides that can be called in hot code chances are that you are spending a considerable amount of time waiting for the instruction cache misses on the calls to the virtual function in question. Of course, finding which virtual functions in your program are these expensive ones requires profiling the workloads you care about improving, but always keep an eye for this problem as unfortunately the object-oriented programming model in C++ really encourages writing code like this. This is the kind of issue that a native profiler is probably more suitable for discovering, for example if you are using a simple native sampling profiler these issues typically show up as a long amount of time being spent on the first instruction of the virtual function being called (which is typically an inexpensive instruction otherwise.)

Now it's time to acknowledge the work of all of you who have helped in improving the performance of the browser in the last week. As always, I hope I'm not forgetting anyone:

Doug Thayer ported the Gecko Profiler add-on to be a WebExtension! One important impact of this work is that this makes it possible to profile Firefox using this add-on without incurring the performance impact of having an extension using the add-on SDK installed.
Kris Maglione added support for pre-loading scripts during startup on a background thread. This helps improve startup performance for the parent process.
David Anderson made us composite asynchronously on Windows when resizing a widget. This can reduce main thread jank for example when opening a window. He also made PLayerTransaction’s constructor async removing a synchronous IPC message that we used to incur when opening a new window.
David Baron ensured that PLDHashTable’s second hash doesn't have padding with 0 bits for tables with capacity larger than 2^16. This hopefully reduces the risk of encountering long chains in large hash tables, which could improve some of the hash table performance issues we have noticed come up in profiles.
Cameron McCormack made dom::FontFace cache its gfxCharacterMap instead of rebuilding it every time.
William Chen made us reuse StackNodes in HTML parser TreeBuilder in order to avoid malloc overhead.
Gabor Krizsanits enabled preallocating content processes by default, which should give us perceived performance wins on new tab and window opens.
Nathan Froyd made it possible to profile Stylo Rayon threads using the Gecko profiler.
Bas Schouten moved pointers to DisplayDataItems directly on nsIFrame. This will allow more efficient access to them by avpiding a lot of hashtable lookups, and providing better data locality.
Nika Layzell made us avoid checking for permissions that almost never exist unless they do exist to bypass the overhead of nsContentBlocker.
William Chen flattened attribute storage in the HTML parser in order to avoid the cost of dynamic memory allocation.
Thinker Li added a shortcut to nsFrame::BuildDisplayListForChild() in order to improve display list construction speeds by remembering the results of the previous rounds of computation.
Tim Taubert imposed a 2KB limit on the amount of session storage data preserved by session restore.
Jan de Mooij optimized Array.prototype.shift to have O(1) rather than O(n) behavior. This is especially nice considering JS libraries using arrays as queues which tend to call shift() inside a loop which would cause us have quadratic behavior before this change.