Skip to main content

Why Do Internet Services Fail, and What Can Be Done About It?

The number and popularity of large-scale Internet services such as Google, MSN, and Yahoo! have grown significantly in recent years. Such services are poised to increase further in importance as they become the repository for data in ubiquitous computing systems and the platform upon which new global-scale services and applications are built. These services' large scale and need for 24x7 operation have led their designers to incorporate a number of techniques for achieving high availability. Nonetheless, failures still occur.  Although the architects and operators of these services might see such problems as failures on their part, these system failures provide important lessons for the systems community about why large-scale systems fail, and what techniques could prevent failures. In an attempt to answer the question "Why do Internet services fail, and what can be done about it?" we have studied over a hundred post-mortem reports of user-visible failures from three large-scale Internet services. In this paper we
  • identify which service components are most failure-prone and have the highest Time to Repair (TTR), so that service operators and researchers can know what areas most need improvement;
  • discuss in detail several instructive failure case studies;
  • examine the applicability of a number of failure mitigation techniques to the actual failures we studied; and
  • highlight the need for improved operator tools and systems, collection of industry-wide failure data, and creation of service-level benchmarks.
Read the paper here.

Ron - the problem management guy


Popular posts from this blog

LDWin: Link Discovery for Windows

LDWin supports the following methods of link discovery: CDP - Cisco Discovery Protocol LLDP - Link Layer Discovery Protocol Download LDWin from here.

Battery Room Explosion

A hydrogen explosion occurred in an Uninterruptible Power Source (UPS) battery room. The explosion blew a 400 ft2 hole in the roof, collapsed numerous walls and ceilings throughout the building, and significantly damaged a large portion of the 50,000 ft2 building. Fortunately, the computer/data center was vacant at the time and there were no injuries. Read more about the explosion over at hydrogen tools here .

STG (SNMP Traffic Grapher)

This freeware utility allows monitoring of supporting SNMPv1 and SNMPv2c devices including Cisco. Intended as fast aid for network administrators who need prompt access to current information about state of network equipment. Access STG here (original site) or alternatively here .