Quantum Flow Engineering Newsletter #3

Another week, another Quantum Flow engineering newsletter! We have a lot to cover, so let me get started.

Nika Layzell is getting really close on her work on bug 1346415 in order to collect native stacks from Background Hang Reports through telemetry on Nightly. There are several practical concerns around this data collection, things such as not blowing up our telemetry ping size, and also the processing of this data on the server side, and we have some ideas on how we can improve this in the future. Since some data is better than no data, we're trying to start with having each client send a maximum of 300 of these native stacks in each ping to begin with, and will hopefully grow this limit in the future to be able to collect more data. He has also been helping with writing some scripts for post-processing this data so that we can have an automatically generated nightly report set from these pings to triage. The triage itself, of course, will be a manual, excruciating (read: “fun”!) process for now, until we think of something better.

We have finished an initial round of triage of the Quantum Flow bugs. We are using a few tags, which are all described here. The most important bug tag to pay attention to at this point is [qf:p1] in the status whiteboard field. This tag means we believe this bug may have a large impact on performance, and it needs to be fixed now. We try our best to make it obvious why we believe this to be the case, and of course not all [qf:p1] bugs are all of the same level of importance, but if you believe there is strong evidence why a [qf:p1] bug isn't of utmost importance for performance, please feel free to raise the issue on the bug, it's best to correct any possible triage mistakes as soon as we can. Otherwise, we really appreciate your assistance in addressing these bugs. Note that we are dealing with a massive project (making the entire web browser faster for all users in all usage scenarios) under a very strict timeline (by Firefox 57!) and the longer we let these bugs live in Firefox, the longer they can mask smaller and less severe performance issues, putting the entire effort at risk.

Next week we are going to have a work week around Quantum Flow in the Toronto office. There are many people attending from different parts of Mozilla and it's going to be a really exciting and super packed week. Several things excite me personally. I expect to spend some more time profiling and delving down into technical issues. I also expect to spend some time talking to people on various teams about how we can facilitate getting more help from even more engineers on fixing the bugs that we are finding. One of my goals is to make the bottleneck of our pipeline be the discovering of new issues to fix, and I hope to get closer to achieving that after next week. Another exciting thing happening next week is that we have some members from the Quantum DOM team also attending the work week (including myself, as I'm still involved in that project as well.) We're hopefully going to have a more concrete plan around cooperatively scheduling of JavaScript running on web pages, which is a really important part of the overall picture of the improvement of the performance of the browser. I don't expect to be able to send out one of these newsletters next week though, so expect the next one in two weeks!

Now I want to talk a bit about our synchronous IPCs. I've talked about them before, but they deserve more air time, as based on the data we have so far, they are one of our biggest performance issues at this point. I have been thinking about good ways of making the extent of the problem more obvious. We already have a tracker bug, and some people have been helping with a few of these bugs (see below), but I still think our progress on this issue could be better. So let's open up this closet and take a look at our skeletons, shall we?

I have prepared a Sync IPC Report for 2017-03-23. It's a spreadsheet, with a chart! So cool. The first thing you'll notice is that I'm not great at data visualization. :-) With that out of the way, let's look at the data. We could sort this data in various ways, but I have chosen to stick to something super simple, sort it in descending order of median time of the sync IPC times the number of times it happens in the wild. You can inspect the data yourself, but here is a human readable summary of where we are now:

PCookieService::Msg_GetCookieString (aka, what happens when a page calls document.cookie!) at 34%. This is the most horrible sync IPC that we have (and it's one of the most popular APIs on the web.) Amy Chung is actively working on fixing this, and Josh Matthews is helping her with providing feedback on her patch. Thanks to you both!
PContent::Msg_RpcMessage and PBrowser::Msg_RpcMessage at 26.9%. These two are together forming a big bucket consisting of all of the sync IPCs triggered from JS. In order to stop flying blind here, bug 1348113 was filed to collect specific telemetry on this bucket. I recently found out that a page calling navigator.userAgent to do UA sniffing (which is also super common) can result in sync IPC that happens through JS and this stayed hidden from us for a long time in this telemetry data…
A number of PScreenManager sync IPC messages at 12.8%. Kan-Ru Chen has done some amazing work to fix all of them, and the patch set is really close to landing any day now.
Then there is a bit of a longer tail, and I have looked at some of them in some detail:
- CPOW overhead: basically PJavaScript and anything under it. Some of this could be caused by add-ons that aren't e10s compatible yet. I need to investigate more to get a better sense of how true this statement is!
- Graphics initialization sync IPCs: PContent::Msg_GetGfxVars and PContent::Msg_GetGraphicsDeviceInitData. These should be easy to fix but we've had a bit of a difficult time getting help in fixing them. Gerald Squelart has recently stepped up to the task, thanks Gerald! These are important for navigation performance, as I mentioned in my previous newsletter.
- PContent::Msg_CreateWindow. This one also has a pretty bad impact on navigation, even when we don't need to start a new content process! I have a patch that fixes this enough to make things work for basic browsing, but it's far from passing tests still…

If you see an IPC message on this list that looks familiar to you and doesn't have a bug that tracks fixing it already, please feel free to file one. If you are familiar with an area of the code where one of these messages is being used, please consider fixing one or two. :-)

Now, it's time for our performance story of the week! This time we're going to look at how not to do off-main-thread I/O. Usually when people talk about avoiding main thread I/O, the goal is to make it so that the main thread doesn't end up calling a function that could end up being blocked until the (potentially spinning) disk finishes an I/O operation. Typically this is done in one of the two ways, either using a non-blocking I/O API that the underlying OS provides (to get the OS to call you back when the I/O is finished) or make a background thread call the mentioned function, and notify your main thread itself. In our implementation of the XMLHttpRequest in Gecko, in order to support the blob response type, we need to open a temporary file to write the incoming data to. Opening this file is an I/O operation, and we use the second strategy in order to avoid a main-thread I/O. Now, it turns out that we had this code which was expecting NS_OpenAnonymousTemporaryFile() to fail in the sandboxed content process where, the author expected, opening the temporary file handle would fail. But then, that wasn't what that function was doing at all! That function was doing all in its power to do what the caller asked it to, that is, to open an anonymous temporary file. The way that the function did it in the content process in a background thread was to dispatch a synchronous runnable to the main thread, blocking the calling thread (in this case, the Gecko IO thread) and then dispatching a synchronous IPC message to the parent process. At this point, two threads would be blocked in the content process. As if that weren't enough, the handler for the sync IPC in the parent process would then call the same function on the parent process main thread leading to main-thread I/O on our UI thread! Of course, all of this was the unintended interaction of different parts of the code when combined together, and I'm glad to report that this is all now fixed on Nightly. :-)

Last but not least, time for the credits section again. I would like to thank the following individuals for their help in making Firefox faster this past week. As always, apologies to those who I'm forgetting to name here.

Kris Maglione did some heroic work to avoid reparsing our content scripts every time we run them. This was a pretty severe performance issue that impacts a lot of add-ons that rely on content scripts, but fixing it wasn't very easy, and honestly when the bug was filed I wasn't very hopeful to see it fixed any time soon given the amount of work that was involved.
Sam Foster has been attacking a synchronous reflow that can happen when we (de)activate a browser window. The work in ongoing, but these types of front-end bugs, even though they may not be much fun to work on, are very important to fix and can remove a lot of jank that we won't be able to get rid of in any other way. Thank you Sam!
Mike Conley landed some instrumentation for tab closing. In case you're wondering, this means we're taking tab closing performance very seriously.
Mike Conley also made us create the about:blank placeholder document for lazily restored tabs after a session restore in the content process. If that sounds boring, how about this: he improved session restore times for users with hundreds of tabs by a lot. Users are reporting improvements on the scale of minutes (you read that right.)
Mike de Boer has been helping with triaging some session restore performance bugs.
Kearwood (kip) Gilbert has been continuing his work on removing the synchronous IPCs used in the WebVR implementation.
Nika Layzell removed a synchronous IPC which was used to initialize the permission manager's database. As an additional privacy win, the content process now only knows about the permissions belonging to the websites that you have visited, not all of the permissions stored in your profile!
Nika Layzell also added telemetry for IPC message serialization/deserializaion that happens on the main thread. There's some evidence that this can be expensive, and this probe will help us find the IPC messages where this can be problematic in the wild.
Chris Pearce made media cache initialization use asynchronous IPC.
Jeff Muizelaar removed an async pan/zoom logging message which was slowing us down to log information that nobody was looking at!
Olli Pettay brought the performance of accessing MouseEvent.offsetX/Y on simulated click events on par to other engines.
Edgar Chen and Boris Zbarsky worked on a few optimizations for improving our innerHTML setter performance.
Henry Chang fixed a severe UI jank that could occur when using tracking protection (for example in private browsing windows).