Monday, February 2, 2009

Failure Trends in a Large Disk Drive Population

Monitoring 100,000 disk drives isn't easy. First, you need to have that many disks. And also you have to collect the data, store and process them. Auhors of this paper managed to built a lightweight monitoring system years before the paper and log data into Google's infrastructure like BigTable and process them using famous MapReduce.

Based on the nature of hard drives, it's not surprising to see heat and utilization do not reduce lifetime of hard disks. Also the failure rate has a close relationship with specific hard drive model. Also we have to be careful about the collected data, because wired things happen when hardware has problems.

This work provides real-world numbers of hardware failures. These numbers are much more important to the end users than manufacture labels. It also encourages admins that have similar hardware settings to collect and analyze hardware data. Hope there will be more data like this available, so researchers can test methods and algorithms on real data.

No comments:

Post a Comment