Monday, February 2, 2009

Data Center Failure As A Learning Experience

Key findings: hardware failures usually happen at the beginning or end of its life.
I don't understand why they fail shortly after put in service. But because they are newly deployed, admins should take special care of them, so the impact of failure should be limited. The wear-out failure when hardware get old should be predictable. There must be some signs before the hardware completely fail. For example, a disk may start to have I/O errors before it crashes. So a careful monitoring and mining system can help admins discover such potential problems and respond before failure happens.

Since engineers today use software to improve availability, like Google did in its DCs, a failure can be tolerated as long as its scope isn't large. But how to keep a failure local is really difficult.

No comments:

Post a Comment