Managing Tons Of Data

How much data do large corporations manage? Tons of it. Referring to "tons" of data may be intuitive for paper records, but it's an unusual way to describe computer-stored information, which is usually measured by character counts and file sizes. Still, using ton may give an added sense of how much data a terabyte is. To be sure, measuring data by the ton isn't definitive because a disk drive's weight doesn't vary significantly over a wide range of storage capacities, but it's a handy starting point. A common 8GB hard drive weighs a little more than 1 lb. Figure that the weight of a shared enclosure, power supply and electronics will roughly double the drive's weight, and we can say that 8TB of data is approximately equivalent to 1 ton. That much storage is cumbersome and ungainly.

How does an enterprise deal gracefully and effectively with such unwieldy mountains of information? We asked four data-intensive companies - Aetna Inc., The Boeing Co., Atos Origin and AT&T Corp. - to tell us about the problems they faced in managing massive data stores, and how they solved them. For each company, the data is a significant corporate asset resulting from huge investments of time and effort. The data is also the source of many trials and tribulations for the employees who keep vigilant watch over it.

While these companies say that good tools are important for managing terabytes of information, their IT and database administrators also agree that having a clear and comprehensive perspective on the data, via both logical and physical views, is even more critical. Security, data integrity and data availability aren't trivial concerns, they point out, and giving users easy access to the data is a never-ending job.

Tips for Managing Large Data Stores

Be selective in how you implement HSM. Instead of blindly giving all your data to a robotic HSM process, analyze and classify your company's data usage to know how often the data is reused and thus when HSM might be appropriate.

The logical view of your data is just as important as the physical view. Knowing which data elements are duplicated in your database and why tells you not only the degree of normalization but also what fraction of the database is involved in purely redundant I/O.

Perform data backup/restore fire drills periodically and religiously to make sure you don't lose lots of data to human error or natural disaster.

Recognize that you may have to develop your own transaction-aware backup software - especially if you have a growing database and your relational database engine doesn't support hot backups. It's not funny when you run out of time for making off-line backup copies.

Carefully segregate externally visible data from your internal data, for security purposes. An ounce of prevention is worth a ton of cure.


Insuring a Healthy 21.8 Tons

On a daily basis, Renee Zaugg, operations manager in the operational services central support area at Aetna, is responsible for 21.8 tons of data (174.6TB). She says 119.2TB reside on mainframe-connected disk drives, while the remaining 55.4TB sit on disks attached to midrange computers running IBM's AIX or Sun Microsystems Inc.'s Solaris. Almost all of this data is located in the company's headquarters in Hartford, Conn. Most of the information is in relational databases, handled by IBM's DB2 Universal Database (Versions 6 and 7 for OS/390), DB2 for AIX, Oracle8 on Solaris and Sybase Inc.'s Adaptive Server 12 on Solaris. To make matters even more interesting, Zaugg adds, outside customers have access to about 20TB of the information. Four interconnected data centers containing 14 mainframes and more than 1,000 midrange servers process the data. It takes more than 4,100 direct-access storage devices to hold Aetna's key databases.

Most of Aetna's ever-growing mountain of data is health care information. The insurance company maintains records for both health maintenance organization participants and customers covered by insurance policies. Aetna has detailed records of providers, such as doctors, hospitals, dentists and pharmacies, and it keeps track of all the claims it has processed. Some of Aetna's larger customers send tapes containing insured employee data, but Nancy Tillberg, head of strategic planning, says the firm is moving toward using the Internet to collect such data.

"Data integrity, backup, security and availability are our biggest concerns," Zaugg says. Her data handling tools, procedures and operations schedules have to stay ahead of not only the normal growth that results from the activities of the sales, underwriting and claims departments but also growth from corporate acquisitions and mergers.

Like Atos Origin and Boeing, Aetna uses IBM's Virtual Tape Servers (VTS) to reduce its tape drive bottleneck. Zaugg says Aetna has used VTS to shrink its tape library from almost 1 million volumes to just under a quarter of that amount. She emphasizes that the major impetus for the consolidation was the time required for tape processing and handling, not the cost of tapes.

Since DB2 V6 doesn't support hot backups, the operations area has to take the DB2 V6 systems off-line to make backup copies. VTS lets Aetna drastically cut the time it takes to back up the DB2 V6 and other data, which increases the time the data is available to users. "Aetna's goal is to soon have hands-off tape operations on its mainframe computers," says Tillberg.

She adds that Aetna has a server consolidation effort under way to reduce the effort necessary to manage data on the midrange machines. "Nonetheless," she says, "the need for server load balancing won't go away soon." For its Web servers, Aetna uses Sunnyvale, Calif-based Resonate Inc.'s Global Dispatch to distribute HTTP traffic to the nearest available server that's least busy. Tillberg says she likes the way Global Dispatch manages mirrored Web servers located not only in the same room but also in geographically dispersed locations.

Tillberg also says the company is increasing its use of storage-area network (SAN) technology to centralize and streamline the management of that data. She points out that Aetna uses Global Enterprise Management software from Tivoli Systems Inc. in Austin, Texas, to monitor the network, distribute files and track data usage.

Aetna's database administrators maintain the more than 15,000 database table definitions with the ERWin data modeling tool, according to Michael Mathias, an information systems data storage expert at Aetna. Manual upkeep of the table definitions became impossible years ago, he says. Mathias sees the importance of viewing the maintenance of large amounts of data from a logical perspective. While the physical management of large data stores is certainly a nontrivial effort, Mathias says that failing to keep the data organized leads inexorably to user workflow problems, devaluation of the data as a corporate asset and, eventually, customer complaints.

Tons of Flying Data

LeaAnne Armstrong, director of distributed servers at Seattle-based Boeing, makes sure the approximately 50TB to 150TB (6 to almost 19 tons) of data the company owns remains as reliable and safe as the aircraft and spacecraft the company builds. She says the 50TB to 150TB estimate reflects Boeing's inability to know exactly how much data exists on its 150,000 desktop computers. Users don't necessarily store their data files on a server, which makes quantifying Boeing's data stores difficult, she says.

Like Aetna, Boeing has tens of mainframes and thousands of midrange servers running Unix and Windows NT. "Much of the data exists in relational form," Armstrong says, "but across the enterprise, Boeing uses virtually every file format known to man." According to Armstrong, Boeing's files run the gamut from Adobe Portable Document Format to computer-aided design and manufacturing machine and part descriptions. The relational databases are DB2 on the mainframe, Oracle on the Unix (HP-UX, AIX and Solaris) midrange machines and either SQL Server 7 or SQL Server 2000 on the smaller Intel-based Windows NT computers.

For Boeing's diverse terabytes, Armstrong shares Zaugg's basic concerns: data integrity, backup, security and availability. The two companies have similar philosophies and approaches to handling large amounts of data. Like Aetna, Boeing uses IBM's VTS to cache and manage its mainframe tapes and tape devices. Boeing plans to use SAN technology in the near future and to consolidate midrange servers rather than let them proliferate.

Armstrong also says effective use of virtual tape or any other hierarchical storage management (HSM) scheme depends on identifying the categories of data within the enterprise and treating each category appropriately. For example, she warns, Boeing makes a subtle but important distinction between backup tapes of transactional content vs. archive tapes of static aircraft design and manufacturing files. She says data must be classified carefully to get the most value from virtual tape.

Boeing's data stores are spread out across 27 states and a few overseas locations, but most computing takes place in the Puget Sound area of Washington. Armstrong says the company currently has dozens of different backup and restore software utilities. Each department buys its own backup media and performs its own backup and restore operations. A major data loss hasn't happened yet, says Armstrong, but she's aware of the risks and plans to centralize the backing up and restoring of files in the future.

Armstrong says she hopes the hard disk, optical disk and tape drive manufacturers will eventually offer Boeing vendor-neutral and highly interoperable data storage. Furthermore, although hard disks are inexpensive these days, Armstrong says data management costs on a per-disk or per-tape basis are high enough that she wants to significantly reduce the amount of disk and tape "white space" - the portion of the media that Boeing doesn't use.

Virtual tape technology helps, she says, but she wishes that all Boeing's tapes and disks were based on a "storage-on-demand" model, whereby Boeing could simply rent whatever capacity it needed from an outside vendor and not have to worry about running out of space.

Phone Calls Galore

Mark Francis, enterprise architecture director at AT&T, manages several terabytes of information. One of his biggest data stores is a multiterabyte mainframe DB2 database containing phone-call detail. When an AT&T customer makes or attempts to make a call, the switching equipment automatically inserts a new row in the huge database.

For Francis, however, the company's new 650GB operational database of customer data, work orders and billing data is more interesting. He says the company is merging diverse databases of various kinds of customer data into a single, cohesive and consistent database. The project is well under way. "The goal is for everyone within AT&T to have one place to go to get any and all customer data," says Francis.

In the past, IMS-DBDC, DB2, Oracle and Informix Corp. systems were all used to control access to parts of the data, but Francis and his group have chosen Oracle to be the single repository for the new consolidated customer database.

Mirrored across two data centers located in Georgia and Missouri, the new customer data store resides on Sun Ultra 10000 computers. Sun Ultra 5500 computers perform data backup chores, and the two data centers are optically linked to allow fast fail-over among the machines should disaster strike. Francis says the company allots Sundays for doing full backups and performing software maintenance. AT&T uses Veritas Software Corp.'s NetBackup to make copies of the customer database. While backing up Oracle redo logs provides an ongoing incremental copy of the data, Francis says the process is time-consuming, and he wishes it weren't such a bottleneck.

Francis schedules periodic fire drills to ensure that each of the two data centers can fail-over quickly and painlessly. He points out that managing large data stores across multiple data centers means more than just monitoring hard-disk devices. "At fail-over time, an entire data center - computers, storage, computing infrastructure and network connections - must pick up the workload without skipping a beat."

To handle large data stores efficiently, Francis suggests, "Don't underestimate the time it takes to get the data model - i.e., the schema - and the operational environment correct." Like Aetna's Mathias, Francis stresses the importance of an accurate and well-organized logical view of large data stores.

Terabytes That Follow the Sun

Mark Eimer, director of global automation tooling at Atos Origin, is responsible for about 300TB (37.5 tons) of other people's data. The majority of the data is relational, but, like Boeing, the Paris-based company stores thousands of different file formats. Atos Origin provides outsourcing services and data operations for other companies. Eimer says one Atos Origin customer is itself an enterprise with 130,000 employees. These users access several terabytes of Lotus Notes data on 600 servers.

1 2 Page 1
Page 1 of 2
How AI is changing office suites
Shop Tech Products at Amazon