Quantum Flow Engineering Newsletter #2

This past week was another busy week chasing down performance issues in Firefox. We managed to knock out a few issues, get closer to close out a couple of really high impact ones, and are making good progress on starting to get performance data from telemetry which will hopefully allow us to prioritize our efforts in a systematic way in order to focus on issues that hurt our users the most in the wild first.

Another nice aspect that we are starting to get some traction on is scaling up the engineering side of the project. Jean Gong has started to help out with the project management side of things, and we have started to triage the list of bugs that we have, with the goal of identifying our highest priority bugs to ensure that they all have assignees and are being worked on and won’t fall into the cracks. We appreciate your help if someone approaches you asking for help with fixing, code reviews, or answering a question about one of these bugs!

There is a work week for Quantum Flow on the week of March 27 here in Toronto. We’re preparing to meet face to face for the second time for this project. One of the things that I’m trying to have ready in time for this work week is telemetry data about where Firefox is performing really badly in the wild so that we can focus there first. Right now we have Background Hang Reports data that can collect a backtrace of hung threads in two modes: if the thread is hung for more than 128ms, a backtrace using Gecko Profiler pseudo stacks is captured, and if a thread is hung for more than 8 seconds, a backtrace using the full native stack is captured. The pseudo stack backtrace doesn’t include a lot of information, the backtrace only consists of the manual annotations that we have added to the source code using PROFILER_LABEL annotations. I have already skimmed over the former set of data and it’s really hard to gather much meaningful information from this data. The native stack traces would be much more useful, but while 8 seconds of a thread being hung is really bad, that’s more of a hang scenario than a badly performing browser, so we’re trying to reduce this threshold in bug 1346415 to gather better data here. I hope to have some more information to share about this next week.

Now, time for the performance story of this week, page navigations! As web browser makers, we talk about page load times a lot, and we all have heard of what usually gets talked about in this context many times. I’m going to talk about what usually doesn’t get talked about though: what can happen in the real life when you navigate from page A to B. Firstly, with multiple content processes, we may need to start a new content process for the navigation. Right now when a content process starts up, it sends a number of synchronous IPC messages to the parent processes in order to initialize various components (although we have removed all except for the last few remaining ones.) This is especially bad since at this time the parent process is typically busy doing other work. For example, since the kind of navigation that results in a process switch typically happens in a new tab/window, the parent process is typically busy opening a new tab/window, and because of that, in really bad cases I have seen these synchronous message take an overall time of over a second of the content process just being paused doing no work whatsoever. This can slow down navigations significantly. There is also a synchronous IPC that is on the path of all navigations (bug 1337064) where we run this risk on all navigations. We also do some synchronous IPCs if the navigation results in an error page under some situations, which is of less concern since those are less common (well, one would hope at least.)

Fixing each one of these doesn’t mean that navigations suddenly become faster of course, the logic works more against us than in our favour: not fixing them means that we will always run the risk of page navigations being slow in Firefox due to unpredictable factors. What’s really worrying is that in general it’s really hard to know what performance cliffs like these are going to be on the path of any critical user interaction, and these issues have a way of creeping in over time. This is why a while ago we decided to disallow the addition of new synchronous IPC messages by default (bug 1336919) to avoid programmers adding more issues of this nature to the code base. We may still decide to add a few more of these messages here and there, but only after really careful consideration and measurement. Like most other things in engineering, this requires careful thought and balancing, but it’s good to have default practices that don’t result in potentially disastrous performance cliffs. Next week, I’m going to give you another example of one of these cliffs showing how through an unintended consequence of matters, code that was trying to avoid doing main-thread I/O was ending up blocking not one, but three threads, to do the said I/O!

Now, on to the credits section. I’d like to take a moment to recognize the work of the following individuals who have helped with various aspects of the Quantum Flow project. Thank you very much for your help this past week! (Apologies to those who I’m probably forgetting to name here.)

Kan-Ru Chen’s patches for bug 1194751 (moving PScreenManager off of sync IPC) are still under review.
Amy Chung submitted a first iteration patch for bug 1331680 (moving document.cookie off of sync IPC) for feedback.
Kearwood (kip) Gilbert has been helping with removing various sync IPC messages used in the WebVR implementation (bug 1344216 dependencies).
Boris Zbarsky made various input/textarea selection management APIs faster in many cases (bug 1343275, bug 1332036, bug 1342197).
Greg Tatum and Markus Stange’s improvements to https://perf-html.io/ profiler UI significantly improve the responsiveness of the interface, making it much easier to look at profiles.
Nicholas Nethercote has been helping by fixing various threading, race and deadlock issues in the Gecko Profiler backend.
Nika Layzell has been helping with various telemetry data collection work.
Mike Conley has been teaching me how to use our telemetry analysis infrastructure.