When the meteor and the 1PB database collide

Craters? No, ginormous amounts of celestial information in need of storage

Our fascination with the prospect of asteroids smashing into the Earth is as deep as the craters that can result from such cosmic fireballs. Think of all the movies Hollywood has made, from little-seen B flicks such as A Fire in the Sky to campy cult classics such as Night of the Comet to scientifically shaky blockbusters such as Meteor and Armageddon.

The 1990s was also awash with news of rocky passersby such as Comet Hale-Bopp and Comet Shoemaker-Levy 9, which unleashed fragments up to two kilometers wide upon Jupiter in 1994.

Once dismissed as the province of fringe cult groups, the fear of what astronomers call "impact events" turns out, thanks to improved satellite and telescopic monitoring, to be not so irrational after all.

Pan-STARRS on patrol

The latest and most ambitious to detect 'near-Earth objects' (NEO) is the Panoramic Survey Telescope and Rapid Response System, or Pan-STARRS.

A joint venture of the University of Hawaii, a number of other schools and the U.S. Air Force, Pan-STARRS is today testing a telescope mounted with the finest digital camera in existence, which boasts a resolution of 1.4 billion pixels.

When Pan-STARRS is fully operational several years from now, it will have four telescopes, each with a 1.4-gigapixel camera.

That will give Pan-STARRS a wider, faster and more-powerful view into space, and will enable it to meet its mandate of tracking virtually all NEOs larger than 300 meters in diameter as well as many smaller NEOs.

It will have plenty to see. About once a year, an asteroid of five to 10 meters in diameter explodes in the Earth's upper atmosphere, releasing as much energy as the atomic bomb used at Hiroshima. And if one slips through, it can cause a lot of damage -- even if it's not a big one.

The asteroid behind 1908's Tunguska Event was only about 50 meters in diameter, but it created an explosion equivalent to 10 to 15 megatons of TNT (about 1,000 times the Hiroshima bomb), knocking over an estimated 80 million trees in Siberia and causing an earthquake that's estimated to have measured a 5.0 on the Richter scale (which had not yet been invented at that time). And we are due for another even like that within 200 years, according to the late astronomer Eugene Shoemaker.

With just a single telescope, Pan-STARRS already generates 1.4 terabytes of raw image data nightly. Compressing, storing and crunching that data in an economical fashion turns out to be a feat of database engineering as impressive as the collection process.

Rather than turning to an expensive supercomputer equipped with hundreds or thousands of processors, Pan-STARRS will use a cluster of 50 PC servers connected to 1.1 petabytes of disk storage via fast Infiniband networking gear, according to Alex Szalay, a physics and astronomy professor at Johns Hopkins University and one of the architects of Pan-STARRS' database.

And rather than using a database management program better-known for ultralarge data warehouses, such as IBM's DB2, a TeraData system or Oracle Database, Pan-STARRS will use Microsoft Corp.'s just-released SQL Server 2008.

Weighing the benefits

Even Microsoft would probably admit that despite improved data compression and a resource governor to manage multiple workloads, SQL Server 2008 is not the most intuitive choice for this clustered, "scaled-out" schema.

"SQL Server 2008 takes us to the next level, but that is within the 'scale-up' model," said Ted Kummert, a vice president in Microsoft's data and storage platform division, this week during a conference about the launch of the upgraded database. Rather, Microsoft's recent acquisition of DATAllegro Inc., a start-up vendor focused on large data warehouses, "will take us to the greatest level of scale-out," he said.

There are several reasons, though, why Pan-STARRS went with SQL Server 2008.

One is cost. Deploying Pan-STARRS will cost just $750,000, thanks to the low cost of the PC hardware and the heavy academic discounts offered by Microsoft for SQL Server and Windows Server 2008.

"People in academia are always operating on a shoestring budget, so we wanted to be able to create something others could emulate," Szalay said.

More important, however, is Microsoft's long involvement with the astronomical community, especially via its technical ambassador, Jim Gray. The noted database researcher, who disappeared at sea in early 2007 and is now presumed dead, was instrumental in building earlier databases, such as TerraServer, a massive free Web archive of satellite pictures of the Earth stored in SQL Server, and the 40TB SkyServer, a similar repository of astronomical images.

Indeed, the distributed database platform that Pan-STARRS (and, it is hoped, other applications) will run on is called GrayWulf in Gray's honor.

"Gray worked with us for more than a decade. All the credit should go to him," Szalay said.

"He changed astronomy as we know it," said Maria A. Nieto-Santisteban, a software engineer at Johns Hopkins and the technical lead of the Pan-STARRS project. "We still ask ourselves, 'How would Jim do this?'"

From magnifying glasses to megastorage

Astronomers first began storing data digitally in the mid-1970s, shortly after they began replacing conventional photographic plates with digital camera technology.

Efficiency-wise, digital cameras were still a vast improvement over those photographic plates, which required astronomers to hunch over them with magnifying glasses, counting galaxies and stars. But the digital image resolution back then left something to be desired -- just 260,000 pixels.

Data storage was also crude. Image data was and is still stored in a low-level format based on 80 character-long punch cards. But the flat files used to store the data proved difficult to search and otherwise manipulate.

Gray guided the building of SkyServer, which holds 100 billion rows of data and 1 million distinct IP addresses, and serves 10,000 to 15,000 professional astronomers as well as countless schoolchildren who use SkyServer to complete astronomy reports.

Pan-STARRS, which Gray helped conceive, will be far larger, containing, by the end of 2010, 300TB of data, with some individual tables as large as 20TB, Szalay said. The repository will include data on more than 140 billion cosmic objects and 5.5 billion actively tracked ones.

Though Pan-STARRS won't use up all 1PB of storage for many years, it will still rank as one of the world's largest databases.

Since Pan-STARRS is set up as a clustered system, the data will be partitioned, with a separate names database serving as the index. Since most cosmic objects don't have names such as Earth or Alpha Centauri, most searches will be done via a graphical interface that, according to Szalay, "looks and feels a lot like MapQuest or Google Maps."

Besides being used to look up data on individual stars or galaxies, Pan-STARRS will also be used to do some deep data mining -- astronomical intelligence, if you will. For instance, Szalay hopes to import old astronomical data from the pre-digital age and run the information through a spatial cross-matching engine in order to create a master database that links all past and present data about every single star or planet.

Pan-STARRS will also serve as a cloud database for outside astronomers, who will be allowed to remotely run queries and store results within Pan-STARRS. An initial difficulty, Nieto-Santisteban acknowledged, is that most astronomers are used to writing applications in C++, not SQL.

Editors' Picks
Join the discussion
Be the first to comment on this article. Our Commenting Policies