Facebook heat maps pinpoint data center trouble spots
A Facebook engineer developed heat-map technology to quickly identify server, rack or cluster failures
IDG News Service - Faced with the challenge of overseeing the health of large caching systems, a Facebook engineer developed heat-map software to quickly pinpoint problems in the social network's data centers.
The visualization monitoring tool, called Claspin, uses the heat map format to portray the working status of Facebook's servers.
"As Facebook grew both in size and complexity, it became more and more difficult to figure out which piece was broken when something went wrong," wrote Sean Lynch, an engineer with Facebook's cache performance team, in a blog entry explaining how he developed the technology.
The idea of using heat maps in overseeing data center operations is an emerging one. At least one Oracle engineer has investigated ways of using heat maps to quickly convey potential problems in the data center.
Whenever the popular social networking service experiences technical difficulties, the cache performance group must make sure that the caching mechanisms are not the problem, or part of the problem. A heat map could be an efficient way of representing operational status of a large number of components. Each component is represented as a cell on a large matrix, and the color of the cell represents the health of the component. A green cell may represent a node that is operating within acceptable bounds, while a red cell may represent one not operating correctly.
Both of these systems produce copious performance metrics -- on various latency, request rate, and error rate statistics. According to Lynch, the caching team was already using a generic heat map to monitor performance. The software, however, could not easily fit the visual data into a single screen. The colors the heat map software used to represent different values offered little intuitive indication of whether a server was performing adequately. And the software didn't interpret the source data in a way that could immediately indicate whether an individual host was running within acceptable bounds.
Lynch designed Claspin, named after a protein that monitors for DNA damage in cells, so that each cluster of servers would get its own heat map, ordered by the rack number within a data center. So problems at the rack level or at the cluster level would become readily apparent by simply viewing the heat map.
- Data Center Projects: Project Management The project management model described in this paper is a framework to show essential characteristics that must be considered in any implementation of...
- The Truth About Virtual Computing for CAD If you're a user of graphics-intensive software such as 3D modeling, simulation and analysis, and visualization, you might be skeptical about moving to...
- Enable secure remote access to 3D data without sacrificing visual perfomance Design and manufacturing companies must adapt quickly to the demands of an increasingly global and competitive economy. To speed time to market for...
- Magic Quadrant for Data Masking Technology IBM is a leader in Gartner Inc's Magic Quadrant for Data Masking Technology. Read the full report to learn about IBM.
- The Key to Happiness: Throw out Your Data Warehouse In this webinar, Kerry Reitnauer, Director, Solution Architect at FairPoint Communications will discuss the challenges the data warehouse brought, how they migrated to...
- Building Tomorrow's Data Center with Converged Technologies A number of forces are converging: the cloud, converged infrastructure, big data and fabric architectures to name a few. All Data Center White Papers | Webcasts
Our new bimonthly Internet of Things newsletter helps you keep pace with the rapidly evolving technologies, trends and developments related to the IoT. Subscribe now and stay up to date!