Ads by TechWords

See your link here
Receive the latest technology news and information.
Computerworld Daily News (First Look and Wrap-Up)
Computerworld Blogs Newsletter
The Weekly Top 10
Cloud Computing
View all newsletters




Privacy Policy
 

Digg Dips Deep Into Open Source

April 30, 2007 12:00 PM ET

Computerworld - SANTA CLARA, Calif. -- Information technology staffers at Digg Inc. credit two particular features of the companys LAMP-based server cluster for helping its Digg.com news aggregation Web site maintain speedy performance in the face of rapid usage increases.

The site, which lets visitors vote on  or digg  their favorite news stories hosted on other sites, recently passed the 1.2 million user mark, according to Elliot White III, an engineer at Digg who spoke at MySQLs user conference here last week.

Digg has about 100 servers that run a combination of Linux, the Apache Web server, the MySQL database and the PHP scripting language  all open-source technologies that are collectively referred to as LAMP. The systems, which are scattered in multiple data centers, include about 20 database servers, 30 Web servers and a few search servers running the open-source Lucene search engine. The rest of the systems operate as backup machines.

In Diggs architecture, a load balancer sends queries to PHP servers, MySQL slave servers that feed data to the PHP servers, and a MySQL master server that feeds data to the slaves. Thats a fairly standard setup. But White said that to get away from sending raw queries against the database, the San Francisco-based company uses open-source memory caching software called Memcached.

First developed for use by LiveJournal Inc.s online journaling Web site, Memcached stores chunks of data that can be pulled out and used to dynamically create a Web page. Conventional caching technologies, which store entire Web pages, would be too slow and inefficient for a site that changes continuously like Digg.-com, White said.

The other atypical feature of Diggs setup is its use of what engineer Tim Ellis called sharding  a term apparently coined by developers at Google Inc. Sharding involves breaking a database into smaller parts to improve performance by isolating heavy workloads.

If 90% of your data is within a certain range and you can get that part working really fast, you can help customers, Ellis said. Then its OK if the remaining 10% is slower.

A database can be sharded by table, date or range. The process is similar to partitioning but with some key differences, Ellis said. For example, sharding usually involves divvying up data onto different physical machines, but partitioning is typically done on the same piece of hardware.

Breaking a database into several smaller pieces can mean more work because of the inability to use common SQL commands, such as table joins, Ellis noted. Developers dont like this crazy stuff, he said.

Digg is really lucky in that 98% of the time, users are reading data rather than writing it to the server, Ellis noted. Most people come to Diggs front page, read it and leave, which is kind of nice, he said, drawing laughs from the audience.



Jump to comments

Digg

Additional Resources

WHITE PAPER
Approximately 60 percent of data migration projects overrun time or budget, while some fail completely. Download this white paper, "Enhancing Your Chance for Successful Data Migration," to learn the critical steps you need to take to execute a data migration project with minimum cost and risk to your business.
WHITE PAPER
Read the Gartner research note to learn why the TCO of a server-based computing deployment used to deliver all applications to users is around 50% lower than that of an unmanaged desktop deployment.
WHITE PAPER
Economic downturns have a tendency to accelerate emerging technologies, boost the adoption of effective solutions, and punish solutions that are not cost competitive or that are out of synch with industry trends. This IDC White Paper presents the results of an IDC survey of 330 companies in Western Europe, Asia/Pacific and the Americas that measures the receptiveness to Linux and takes into consideration changing views driven by the disruptive economic environment that businesses face today.

What People Are Saying

White Papers & Webcasts

Batch Job Scheduling beyond a Single OS Instance
Download this resource now!  

Effectively Implementing Datacenter Automation
Effectively select and deploy the best datacenter automation solution today!

The Power/Density Paradox: The Result of High Density without Power Efficiency
Download this brief to explore what the power/density paradox is and how IT professionals can mitigate the risk.  

XenApp Extends Virtualized Application Delivery
Download this webcast to learn how to accelerate delivery of virtualized applications and streamline management.

If It's Just a Disk...Why the Reliability Gap Between Storage Vendors?
If all storage array vendors buy disk drives from the same small set of disk manufacturers then why is there such a big...  

No More Tiers: Reduce Storage Costs with an Age-in-Place Strategy
Download this whitepaper to discover the easiest and most cost effective way to manage the life-cycle of your data.  

Top HPC Use Cases in Life Sciences
Learn from the experts how best to apply cutting edge high-performance computing techniques a life sciences environment.

A Process-based Approach to Protecting Privileged Accounts & Meeting Regulatory Compliance
Download this complimentary white paper today! Provided by BeyondTrust.  

2 Minutes to IT workload automation
Download this Complimentary Video! Sponsored by BMC Software.