You are here: silicon.com > Software > Applications

Applications

Google says sorry for Gmail's two-hour outage: Blames 'miscalculation'

Work designed to improve availability takes down service

Tags: gmail, google

By Tom Krazit

Published: 2 September 2009 08:27 GMT

Google Gmail suffered nearly a two-hour outage on Tuesday, which was the result of a miscalculation regarding the capacity of its system, the company said late on Tuesday.

Gmail was down from about 12:30(PDT) Tuesday to about 14:30(PDT), affecting millions of Gmail customers. The problem was caused by a classic cascade in which servers became overwhelmed with traffic in rapid succession.

According to Google, the problem began when it took several Gmail servers offline for maintenance, a routine procedure that normally is transparent to users. However, the twist this time was Google had made some changes to the routers that direct Gmail traffic to servers in hopes of improving reliability, and those changes backfired.

"As we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers - servers which direct web queries to the appropriate Gmail server for response," Google said in a post to its Gmail blog late Tuesday.

Ben Treynor, vice president of engineering and site reliability czar said in the blog: "At about 12:30pm Pacific [time] a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!' This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded."

Google fixed the problem by allocating traffic across the rest of its prodigious network, a luxury that it enjoys given the resources it has put in place to operate the world's leading search engine.

Google said it would focus on making sure that the request routers have sufficient headroom to handle future spikes in demand, as well as figuring out a way to make sure problems in one sector can be isolated without bringing down the entire service. "We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements - remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity," Treynor wrote.

Several Google Apps customers who use Gmail for internal email at their businesses and organisations did not return calls Tuesday seeking information on the degree to which they were affected, making it difficult to know the magnitude of the failure. However, Google has put an awful lot of time and money this year behind promoting Gmail as a back-end email software alternative to products from Microsoft and IBM, and embarrassments like this will not help it sell the service to other organisations.

"We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service," Treynor wrote. "Thus, right up front, I'd like to apologise to all of you - today's outage was a Big Deal, and we're treating it as such."

Original article: Gmail outage blamed on capacity miscalculation from CNET News.com

  1. Zones
  2. Management
  3. Networks
  4. Software
  5. IT Services
  6. Hardware
  1. Verticals
  2. Public Sector
  3. Financial Services
  4. Retail & Leisure

Tim Ferguson Exclusive: Former MySQL boss Marten Mickos talks open source Why Microsoft could become one of the "biggest friends of open source" and why Oracle getting its hands on MySQL could be "one of the biggest open source coups ever"...

Naked CIO Naked CIO: Cloud computing more expensive than we thought? Smart IT leaders will examine the impact of how they pay for tech


  • Jobs
Web Designer / Web Developer - HTML, CSS, Photoshop - SEO Enthusiast

Develop the analytics offering ( Omniture site catalyst and/or Google analytics ) Any previous work with CMS systems for Website content updates ...

IP Engineer : CCIP CCNP : Contract : London

IP, TCP, SCTP & UDP Expertise in VPNs, routing & firewalls Expertise in IPSec Expertise in Network Design and Administration (L2/L3, Load Balancer, ...

ILS Manager - Permanent Hampshire 40-70k

Management Planning (inc Decommissioning of equipment) Obsolescence Management and Planning to BS:7000 Management of Availability, Reliability and ...

Agenda Setters 2009
Welcome to the ninth annual Agenda Setters poll – silicon.com's list of the top 50 most influential individuals in the technology and IT industries, from techies and CIOs to entrepreneurs and business leaders. Find out more in our latest special report.





Quick Sitemap Links: