Facebook heat maps pinpoint data center trouble spots
A Facebook engineer developed heat-map technology to quickly identify server, rack or cluster failures
IDG News Service - Faced with the challenge of overseeing the health of large caching systems, a Facebook engineer developed heat-map software to quickly pinpoint problems in the social network's data centers.
The visualization monitoring tool, called Claspin, uses the heat map format to portray the working status of Facebook's servers.
"As Facebook grew both in size and complexity, it became more and more difficult to figure out which piece was broken when something went wrong," wrote Sean Lynch, an engineer with Facebook's cache performance team, in a blog entry explaining how he developed the technology.
The idea of using heat maps in overseeing data center operations is an emerging one. At least one Oracle engineer has investigated ways of using heat maps to quickly convey potential problems in the data center.
Whenever the popular social networking service experiences technical difficulties, the cache performance group must make sure that the caching mechanisms are not the problem, or part of the problem. A heat map could be an efficient way of representing operational status of a large number of components. Each component is represented as a cell on a large matrix, and the color of the cell represents the health of the component. A green cell may represent a node that is operating within acceptable bounds, while a red cell may represent one not operating correctly.
Facebook uses two major cache systems. One uses Memcache, and the other relies on a caching graph database called TAO.
Both of these systems produce copious performance metrics -- on various latency, request rate, and error rate statistics. According to Lynch, the caching team was already using a generic heat map to monitor performance. The software, however, could not easily fit the visual data into a single screen. The colors the heat map software used to represent different values offered little intuitive indication of whether a server was performing adequately. And the software didn't interpret the source data in a way that could immediately indicate whether an individual host was running within acceptable bounds.
Lynch designed Claspin, named after a protein that monitors for DNA damage in cells, so that each cluster of servers would get its own heat map, ordered by the rack number within a data center. So problems at the rack level or at the cluster level would become readily apparent by simply viewing the heat map.
"On a 30-inch screen we could easily fit 10,000 hosts at the same time, with 30 or more stats contributing to their color, updated in real time--usually in a matter of seconds or minutes," Lynch said. The code to parse and compile the operational metrics was written in JavaScript, and the heat maps were rendered using the SVG format.
- Google I/O 2013's Coolest Products and Services
- 10 Star Trek Technologies That are Almost Here
- 19 Generations of Computer Programmers
- 25 Must-Have Technologies for SMBs
- A walking tour: 33 questions to ask about your company's security
- 15 social media scams
- The 7 elements of a successful security awareness program
- IT Certification Study Tips
- Register for this Computerworld Insider Study Tip guide and gain access to hundreds of premium content articles, cheat sheets, product reviews and more.
- Harness IT -- An Introduction to Business Intelligence Solutions Learn the key selection criteria required to provide your organization with the capability to address structured data, unstructured data and mobile demands so...
- Business Intelligence Shows its Smarts Today's Business Intelligence (BI) tools provide a new way to think about data with self-service capabilities and user-friendly analytics that can be used...
- Proactive Planning for Big Data Big data is less about the terabytes and more about the query tools and business intelligence needed to make sense of massive amounts...
- Security Empowers Business Every magazine article, presentation or blog about the topic seems to start the same way: trying to scare the living daylights out of...
- Becoming An Analytics Driven Organization Join us on Tuesday, June 18, 2013, 11:00 AM EDT and learn how your agency can create an analytics culture that will enable...
- 3 Reasons Why Sepaton is the World's Fastest Backup Solution Leading analyst, Storage Switzerland learns how Sepaton backs up and deduplicates massive data volumes while maintaining the industry's fastest performance - all in... All Data Center White Papers | Webcasts
Rising salaries boost IT optimism, though not everyone is feeling upbeat. Our survey of 4,000+ IT workers shows who's riding the wave and why. Use our interactive tool and compare your own paycheck. Read more...