Monday, February 9, 2009

eBay Scalability

Every company wants to cut cost and expand business. For websites, it means reducing server and storage expenses while having more page views and transactions. It's difficult, especially for some one like eBay who already has 2B page view/day and $60B transactions/year.

From the slides, we can learn some secrets of eBay (not very surprising though). The five principles cover partition, async, automation, failure and inconsistency, all of which are classic CS research topics. So I guess the real challenges are implementation and practical trade-offs.

I don't quite understand why a 3rd-party cloud provider will help cutting cost. It looks like we already have too much over-provision in enterprise networks, so cloud providers can sell storage and cycles very cheap compared with self-managed DCs. But how to guarantee data security? There is high risk and resulting shadow cost for cloud.

I'd like to learn more about cost comparison between owned DC maintenance and cloud price.

Monday, February 2, 2009

Designing a highly availabile directory service

Make copies, the key idea of the chapter. Admins have to deal with various situations to safely replicate data and recover them when needed. A significant difference is that possible failures are analyzed extensively, like node failure and link failure. It's complex and difficult to ensure things keep working when multiple failures happen. So we need more replicas to ensure availability. Also the read performance is critical for performance evaluation, so sample topologies are optimized for read performance. The 3 roles: master, hub and consumer help clarifying the situation and improving efficiency.

Failure Trends in a Large Disk Drive Population

Monitoring 100,000 disk drives isn't easy. First, you need to have that many disks. And also you have to collect the data, store and process them. Auhors of this paper managed to built a lightweight monitoring system years before the paper and log data into Google's infrastructure like BigTable and process them using famous MapReduce.

Based on the nature of hard drives, it's not surprising to see heat and utilization do not reduce lifetime of hard disks. Also the failure rate has a close relationship with specific hard drive model. Also we have to be careful about the collected data, because wired things happen when hardware has problems.

This work provides real-world numbers of hardware failures. These numbers are much more important to the end users than manufacture labels. It also encourages admins that have similar hardware settings to collect and analyze hardware data. Hope there will be more data like this available, so researchers can test methods and algorithms on real data.

Crash: Data Center Horror Stories

It's surprising to know given so much money and human efforts invested, DCs may still fail completely. We have to admit it is impossible to foresee all potential failures, especially their details. In other words, we need to plan for unexpected failures, and the most important thing is not avoid failures, but recover from them. Redundancy doesn't guarantee resilience unless redundant resources can be effectively utilized.

Data Center Failure As A Learning Experience

Key findings: hardware failures usually happen at the beginning or end of its life.
I don't understand why they fail shortly after put in service. But because they are newly deployed, admins should take special care of them, so the impact of failure should be limited. The wear-out failure when hardware get old should be predictable. There must be some signs before the hardware completely fail. For example, a disk may start to have I/O errors before it crashes. So a careful monitoring and mining system can help admins discover such potential problems and respond before failure happens.

Since engineers today use software to improve availability, like Google did in its DCs, a failure can be tolerated as long as its scope isn't large. But how to keep a failure local is really difficult.