How Digg.com uses the LAMP stack to scale upward
Caching and 'sharding' data speeds up the social media Web site
Computerworld - Digg.com credits two particular features of its LAMP (Linux, Apache, MySQL and PHP) server cluster for helping the news aggregation site maintain speedy performance in the face of high growth.
The site, which lets its users vote on, or "digg," their favorite news stories hosted on other sites, recently passed the 1.2 million-user mark according to Elliot White III, an engineer at San Francisco-based Digg Inc. He spoke at MySQL’s annual conference in Santa Clara, Calif. on Tuesday.
Today, Digg.com boasts 100 servers scattered in multiple data centers that host a total of 30GB of data, but the site started off in late 2004 as a single Linux server running Apache 1.3, PHP 4, and MySQL 4.0 using the default MyISAM storage engine, White said.
As more users dug Digg, the site moved to an architecture that uses a load balancer in the front that sends queries to PHP servers, MySQL slave servers that feed the PHP servers, and a MySQL master server that feeds data to the slaves.
That's a fairly standard setup. But to get away from "sending raw queries against the database," White said Digg.com uses a software called Memcached. First developed for use by the Livejournal site, Memcached is tailored for dynamic sites like Digg.com, which serve Web pages with content that is constantly changing and is personalized according to user preferences, White said.
Memcached stores chunks of data that can be pulled and used to dynamically create a Web page. Conventional caching systems, which store whole Web pages, would be too slow and inefficient for a site like Digg.
The other atypical feature of Digg’s setup is its use of what Tim Ellis, another Digg engineer, calls "sharding."
A term apparently coined by Google engineers, sharding involves breaking a database into smaller parts in order to isolate heavy loads for better performance.
"If 90% of your data is within a certain range, and you can get that part working really fast, then you can help customers," Ellis said. "Then it’s OK if the remaining 10% is slower."
A database can be sharded by table, date or range. It is similar to partitioning, says Ellis, but with several key differences. Sharding usually involves divvying up data onto different physical machines. Partitioning, in contrast, typically occurs on the same piece of hardware. And while MySQL does not natively allow sharding, it does support partitioned tables, federated tables and clusters.
Digg only recently began sharding. While sharding is helping Digg.com achieve much faster performance overall, breaking a database into several smaller ones increases complexity, Ellis said. That can mean more work for developers and database administrators, because of the inability to use common SQL commands such as joining tables. "Developers don’t like this crazy stuff. That can create pushback," he said.
- The business impact of BYOA: Five major challenges and how your enterprise can solve them This E-Book reviews five major challenges of BYOA with key subject matter experts and outlines how businesses can solve them.
- The BYOA Opportunity Visual demonstration of problems that unmonitored, employee-introduced cloud apps can cause a business, and why IT managers need a solution to help and...
- BYOA: Embracing the Opportunity, Controlling the Risk This whitepaper explores the shift from BYOD to BYOA (bring-your-own-application) and how IT departments today can address this new change in the IT...
- AppGuru Reference Guide: Conquer BYOA Challenges, Leverage BYOA Benefits As the advantages of Bring-Your-Own-Application environments become increasingly apparent, BYOA is quickly becoming a reality for organizations of all sizes. But with the...
- Live Webcast Master the Changing SAP Landscape with Performance Management SAP landscapes are not getting simpler. Gradually, business processes that used to be contained on a single SAP system now involve a range...
- Data Breaches - Don't Be a Headline Whether it's a HIPAA/HITECH, Sarbanes Oxley, Gramm-Leach-Bliley violation, or a State breach notification law, a data breach can have substantial legal and financial...
- Accelerate your innovation with IBM Bluemix™ Join us for a webcast introducing the new IBM BluemixTM. IBM Bluemix (www.bluemix.net) is a developer oriented Platform as a Service (PaaS) environment... All Applications White Papers | Webcasts