But the emoji search addin feature will be developed by a team in mountain view....

korfuri · on Oct 16, 2022

Possibly, but probably not. Google tends to have follow-the-sun rotations staffed by SRE as the first line of defense, with an escalation path to devs if necessary.

SREs may not have intricate knowledge of the code but they have an array of tools that can mitigate problems, and they're software engineers too, so they can debug this themselves if need be. Their focus however won't be on fixing the bug, it will be on stopping the bleeding. Typically they'll check if this is an issue introduced in a recent rollout and roll that back. In the case of Search they also have an array of tactical tools - someone mentioned that a specific website was causing this, they probably have a way to quickly and temporarily delist this specific result.

The focus will be on recovering ASAP, figuring out the details and the long-term fix later. That later can be during business hours on Monday.

klodolph · on Oct 16, 2022

Any major product at Google has an on-call team with offices in multiple time zones. Typically, two engineers from different time zones will be on-call at any given time—they switch back and forth between primary and secondary, so nobody is primary on-call during the middle of the night, local time.

For example, you might have an SRE team in Mountain View, and an SRE team in Dublin. Maybe the engineer in Mountain View is primary on-call until midnight, and then it switches to the engineer in Dublin, who starts their shift at 8:00 AM.

If it’s getting code changes, it’s not getting code changes right now. The software engineers may be in Mountain View, but they won’t patch this and push out a new version during the night. Someone (read: an SRE) may change a configuration or push out a temporary fix now, and any code changes will only go out after significant testing. Generally speaking, you don’t hotfix high-profile production services. You rollback, you set up filters, you disable features, you turn off component services, you run in a degraded capacity—but if you want to actually change the code, you take your time and do it right. Rushing out a code change in the middle of the night is liable to make things worse, and nobody wants to do it.

jawilson · on Oct 16, 2022

For the right definition of "major". Search, Ads, GMail, Docs, absolutely. If you only have 50 million DAU then you may not even have an SRE let alone one in multiple time-zones. Most teams have an on-call rotation and you get a phone-call an hour after you fall asleep unless you recently did a push or go to bed early. (Or you just forget all the pages that happened during normal hours and only remember the annoying ones.)

jawilson · on Oct 16, 2022

It really depends on what is happening behind the scenes. Is it killing an entire server process or just an error in a single thread? Even if it kills an entire server process, Google has lots of these running and the ratio may not be high enough to page.

olliej · on Oct 16, 2022

That depends, I don't think that there needs to be anything emoji specific (at the character level). It could be a special handler for inline results, marketing filters, etc that's gated on some variation of that string.

Might be worth seeing if there's another "how many X on iOS" that faults as well, because I would be that it reports an ISE on timeouts, and you could easily imagine some service making a follow on request that now times out.

kevincox · on Oct 16, 2022

It's likely that there is some tool for blocking or rewriting certisn queries or results that trigger errors so that the code changes can be made with less rush, more testing and during business hours.

Jensson · on Oct 16, 2022

Why do you think so? Lots of important systems are developed in European offices at least, I wouldn't be surprised if some are in Asia as well. It isn't like Google search is only in MTV.

tex0 · on Oct 16, 2022

How do you know?