Quantum Flow Engineering Newsletter #12

It has been a few weeks since I have given an update about our progress on reducing the amount of slow synchronous IPC messages that we send across our processes.  This hasn’t been because there hasn’t been a lot to talk about, quite to the contrary, so much great work has happened here that for a while I decided it may be better to highlight other ongoing work instead.  But now as the development cycle of Firefox 55 comes to a closing point, it’s time to have another look at where we stand on this issue.

I’ve prepared a new Sync IPC Analysis for today including data from both JS and C++ initiated sync IPCs.  First bit of unfortunate news is that the historical data in the spreadsheet is lost because the server hosting the data had a few hiccups and Google Spreadsheets seems to not really not like that.  Second bit of unfortunate news is that our hopes for disabling the non-multiprocess compatible add-ons by default in Nightly helping with reducing some of the noise in this data don’t seem to have panned out.  The data still shows a lot of synchronous IPC triggered from JS as before, and the lion’s share of it are messages that are clearly coming from add-ons judging from their names.  My guess about why is that Nightly users have probably turned these add-ons back on manually.  So we will have to live with the noise in the data for now (this is an issue that we have to struggle with when dealing with a lot of telemetry data unfortunately, here is another recent example that wasted some time and energy).

This time I won’t give out a percentage based break-down because now after many of these bugs have been fixed, the impact of really commonly occurring IPC messages such as the one we have for document.cookie really makes the earlier method of exploring the data pointless (you can explore the pie chart to get a quick sense of why, I’ll just say that message alone is now 55% of the chart and that plus the second one together form 75% of the data.)  This is a great problem to have, of course, it means that we’re now starting to get to the “long tail” part of this issue.

The current top offenders, besides the mentioned bug (which BTW is still being made great progress on!) are add-on/browser CPOW messages, two graphics initialization messages that we send at content process startup, NotifyIMEFocus that’s in the process of being fixed, and window.open() which I’ve spent weeks on but have yet to fix all of our tests to be able to land my fixes for (which I’ve also temporarily given up working on looking for something that isn’t this bug to work on for a little while!).  Besides those if you look at the dependency list of the tracker bug, there are many other bugs that are very close to being fixed.  Firefox 55 is going to be much better from this perspective and I hope the future releases will improve on that!

The other effort that is moving ahead quite fast is optimizing for Speedometer V2.  See the chart of our progress on AreWeFastYet.com:

Last week, our score on this chart was about 84.  Now we are at about 91.  Not bad for a week worth a work!  If you’re curious to follow along, see our tracker bug.  Also, Speedometer is a very JS heavy benchmark, so a lot of the bugs that are filed and fixed for it happen inside SpiderMonkey so watching the SpiderMonkey specific tracker bug is probably a good idea as well.

It’s time for a short performance story!  This one is about technical debt.  I’ve looked at many performance bugs over the past few months of the Quantum Flow project, and in many cases the solutions have turned out to be just deleting the slow code, that’s it!  It turns out that in a large code base as code ages, there is a lot of code that isn’t really serving any purpose any more but nobody discovers this because it’s impractical to audit every single line of code with scrutiny.  But then some of this unnecessary code is bound to have severe performance issues, and when it does, your software ends up carrying that cruft for years!  Here are a few examples: a function call taking 2.7 seconds on a cold startup doing something that became unnecessary once we dropped support for Windows XP and Vista, some migration code that was doing synchronous IO during all startups to migrate users of Firefox 34 and older to a newer version, and an outdated telemetry probe that turned out to not in use any more scheduling many unnecessary timers causing unneeded jank.

I’ve been thinking about what to do about these issues.  The first step is fix them, which is what we are busy doing now, but finding these issues typically requires some work, and it would be nice if we had a systematic way of dealing with some of them.  For example, wouldn’t it be nice if we had a MIMIMUM_WINDOWS macro that controlled all Windows specific code in the tree, and in the case of my earlier example perhaps the original code would have checked that macro against the minimum version (7 or higher) and when we’d bump MINIMUM_WINDOWS up to 7 along with bumping our release requirements, such code will turn itself into preprocessor waste (hurray!), but of course, the hard part is finding all the code that needs to abide by this macro, and the harder part is enforcing this consistently going forward!  Some of the other issues aren’t possible to deal with this way, so we need to work on getting better at detecting these issues.  Not sure, definitely some food for thought!

I’ll stop here, and move on to acknowledge the great work of all of you who helped make Firefox faster this past week!  As per usual, apologies to those who I’m forgetting to mention here:

Tagged with: , ,
17 comments on “Quantum Flow Engineering Newsletter #12
  1. Am I reading that Speedometer chart wrong, or isn’t a lower score better? In that case, it seems like performance has gotten steadily worse.

    • ehsan says:

      Those charts are the reverse, higher is better, and the unit is misreported, it’s the benchmark score, not milliseconds.

  2. Godot says:

    I had a hard time guessing why 91ms is better than 84ms. Then I realized it’s a score and both the graph label (“Execution Time (ms)”) and plot values unit (91ms) are wrong/misleading. You should probably take a look at it.
    Thanks for you great work and reports !

  3. Chris says:

    How much overhead is there for collecting telemetry data for expired telemetry probes? Is the data discarded on the client or server? Even if discarding expired probe data on the client side is cheap, the code to produce the data still has to run and might be slow (like the default browser check you linked to).

    btw, the Speedometer graph is a little confusing because the Y axis is mislabeled (on AWFY) as “Execution time (ms)”, where smaller would mean faster.

    • ehsan says:

      In this case the bad part about the code was the fact that it was scheduling repeated timers but we have also had cases where the actual collection code has been expensive, see for example this bug. It really depends on the case at hand. And I’m sure we also have useless telemetry probes lying around that in no way hurt performance also…

  4. Caspy7 says:

    Really excellent work everyone!
    Thanks for the update Ehsan!

    I think the best thing that Quantum Flow can do, besides its main performance mission right now, is to prevent future changes from reintroducing code with the mistakes it’s currently fixing.

    We simply can’t afford for things to get worse over time and then do another cleanup (and can’t rely on institutional knowledge to save us).

  5. Pascal Chevrel says:

    >My guess about why is that Nightly users have probably turned these add-ons back on manually. So we will have to live with the noise in the data for now

    If this is not effective, maybe extensions.allow-non-mpc-extensions should be set to true by default on Nightly?

    • ehsan says:

      That seems to be too controversial at this point, and we will probably turn these add-ons off anyway at some point during the 57 time frame anyway… :-/

  6. These articles are very informative, thank you very much! I wanted to ask a couple of questions using quotes from the article:

    1. “Andrea Marchesini landed infrastructure that should allow background tabs to have lower process priority. As support for each platform lands…”
    I am trying to find a bug in order to track when this feature will be enabled in Nightly but I only found bug 1366358. Could you help, please?

    2. “..our hopes for disabling the non-multiprocess compatible add-ons don’t seem to have panned out. The data still shows a lot of synchronous IPC triggered from JS as before, and the lion’s share of it are messages that are coming from add-ons like Tab Mix Plus”
    Let’s imagine I want to hire Firefox developer in order to write an addon like Tab Mix Plus from scratch (using planned toolbar API or WebExtensions Experiment). How much will it cost (without support when Nightly update breaks something)? Key TMP functionality: multirow tab bar, switch tabs by scrolling the mouse wheel, change opening order of tabs, keyboard shortcuts like Ctrl-Q to reopen the last closed tab.

    • ehsan says:

      About the first one, I don’t think the follow-up bugs have been filed yet. I think he’ll put a link to them in the original bug once he files them.

      I have no idea how to evaluate the cost of writing such an add-on to answer your second question.

  7. Gerd Neumann says:

    One big pain point I notice very often (for instance, right now) is when entering text into search, input or textarea boxes is that between text entered (keyboard press) and being visible there are sometime 1-3 secs, sometimes up to 5s.

    Same for moving caret/cursor (the “|”) with the arrow keys.

    Is this being tracked soemwhere as well? (I’ll leave the “soemwhere” typo just as is, because this is one of the human outcomes of these hangs 😉