Review: Deep dive into SQL Server 2008

1 2 Page 2
Page 2 of 2

It's difficult not to compare Microsoft's Resource Governor with Oracle's. Oracle Database has had the feature longer, and it's certainly richer and more mature. But there's one advantage to Microsoft's you just can't ignore: It's included in the license, while Oracle's is a fairly expensive add-on.

Change Data Capture and nonlogged inserts

Change Data Capture (CDC) and nonlogged inserts are two nice enhancements to the SQL Server engine that will make a world of difference for some shops. These enhancements mainly revolve around warehouses, but can be useful on some OLTP systems as well. CDC will alter the way developers write ETL processes for warehouses. CDC employs the same log reader used by replication to write the specifics of DML (Data Manipulation Language) operations to a table instead of a distribution database. This allows warehouses to be very easily loaded from tables that only contain changed data -- no more having to modify legacy applications or ETL processes with update columns in all the tables to provide this functionality. And you can track with perfect precision the changes made to data in your tables without altering your schema or performing a messy and expensive query to filter out the differences.

Retrieving the changes from the CDC tables is very easy also. Microsoft has provided a set of functions that retrieve the data, so all you have to do is call them. As you would expect, both Type 1 and Type 2 data can be stored, so you can have the exact level of change data you want.

Nonlogged inserts allow SQL Server to do minimal logging for large inserts. By not logging every data row inserted, you can speed up your data loads by orders of magnitude in some cases. At the very least, you will see significant performance gains.

You could do nonlogged inserts in previous versions of SQL Server using what's known as a "select into" statement, but "select into" actually creates the table for you. This is great for new tables, but if you have existing tables with security and other attributes defined, you don't want to delete and re-create them. Another problem with "select into" is that you have no control where your new table goes; it always joins the default file group.

The new nonlogged insert allows you to insert the rows into an existing table, maintaining the control you have over your space, performance, and security, and still get the benefit of an ultra-fast data load. The general rule of thumb I like to follow for warehouses (and any database, come to think of it) is no moving schema. That means you don't want to delete and re-create permanent objects every day. It's error prone and introduces complexity into your system you don't need, not to mention your tables aren't available during the load because they've been deleted.

In my testing, non-logged inserts performed right on par with their "select into" cousins. As the DBA of a large warehouse that faces data loading problems every day, I love this feature. There's just one thing I would change about it: I wish there were an option I could tack onto my insert statement to throw it into non-logged mode.

Data Compression

SQL Server's first attempt at Data Compression actually looks pretty good. As I noted in my beta preview (see "Microsoft's 'Katmai' is filled to the brim"), SQL Server 2008 provides two types of compression: row and page. Row compression is true compression, in which unused spaces at the ends of columns are removed to save storage. Page compression, aka dictionary compression, normalizes the data on each page and keeps a lookup pointer. In SQL Server 2008, page compression includes row compression. If you have page compression turned on, you get row compression in the bargain.

Microsoft provides a handy compression calculation wizard that will give you a good estimate of the benefits you can expect. The wizard runs a test compression scenario against your data for each compression type (row and page) and tells you what the new size of the table should be. I tested the compression calculator against a number of data sets, and on average the calculation deviated from my final results by only 1 or 2 percent. That's pretty good, considering that the calculation is based on a relatively small amount of data.

How much will a table compress? That depends on both the type of data and the type of compression. If your data is something like SAP data, which has a lot of trailing spaces, then row-level compression will serve you well. If you have a lot of repeating data, then table-level page compression will be your best friend.

The order of your data matters also. Because SQL Server compresses at the page level, repeating data that is clustered together will be de-duplicated. So it pays to be aware of both the nature and the order of your data.

All that said, just as with Advanced Compression in Oracle Database 11g (see the review and compression test results), my results with Data Compression in SQL Server 2008 were all over the map. I got as little as 17 percent compression on some data sets, and as much as 76 percent on another. The high of 76 percent was reached on denormalized order line data in a data warehouse.

When it comes to Data Compression, SQL Server has one advantage over Oracle: SQL Server will compress existing tables in a single statement. In Oracle Database 11g, you have to create a new table, insert the data, and then delete the old table. Not only does SQL Server spare you from creating an additional copy of the data, but you can still perform operations on the table during the compression operation, no downtime required.

An excellent new feature called Sparse Columns allows you to store null values without taking up any physical space. But here Microsoft made a big mistake: Sparse Columns aren't compatible with Data Compression. If you define Sparse Columns on a table, you can't also apply Data Compression. As I mentioned in my preview, this one may be worthy of a Darwin Award. Sparse Columns and Data Compression would be a perfect match. Let's hope Microsoft fixes this soon.

Backup Compression too

SQL Server 2008 also introduces Backup Compression. In some preliminary tests, I found the compression ratios to be mostly on par with those of Quest LiteSpeed and other third-party backup tools I've used. Still, SQL Server's Backup Compression has a number of strikes against it. First, Backup Compression is only available for the enterprise edition. Even in the enterprise, that's just going to make things harder to manage. Chances are the lion's share of SQL Server instances in any sizable organization aren't enterprise edition, and for these instances to be the exception in the backup plan just isn't acceptable.

Look at it like this: Database size isn't the reason you've chosen enterprise edition. Because most versions of SQL Server can handle unlimited data size, enterprise edition is generally chosen for the features. As a result, you're likely to have plenty of SQL Server boxes in your shop with very large data sets that aren't on enterprise, and they need compressed backups too, which means you probably already have a third-party backup solution in place. To switch backup routines for a small subset of your SQL Server boxes just doesn't make sense. You'll surely want to have the same solution for your entire environment.

Second, unlike SQL Server, the third-party backup solutions have object-level restore, which can come in very handy in a number of situations. If you go with SQL Server's Backup Compression for your enterprise servers, you're losing functionality.

Third, the third-party solutions have centralized repositories and provide centralized reporting and alerting. So if you use SQL Server's native compression, you've effectively eliminated centralized management of backups for those boxes.

There are other features that third-party backup solutions bring to the table, but incomplete coverage, object-level restore, and centralized backup management are the biggest reasons that SQL Server's Backup Compression isn't going to be viable for most shops.

Index improvements

Indexes have received an overhaul. Not only can indexes be compressed in SQL Server 2008, but you can also build filtered indexes. Filtered indexes have a "where" clause, allowing you to partially index a large table. This may not seem very useful at first, but there are situations where it's very beneficial.

For example, let's look at a Sparse Column situation. Say you have 400 million records in a table with a significant number of null values, and you've defined a Sparse Column, so you're not taking up space for all of those nulls. Well, if you don't want to take up storage space with null values, then you surely don't want to take up index space with them either. Here you would define a filtered index on that sparse column where the value is not null so that only rows with actual data in them are indexed.

Not only can you save a lot of index space with filtered indexes, but the queries that use those indexes will be faster because they're running against a subset of the entire data set. You also shrink the maintenance window and re-index space for the index.

Another good use of filtered indexes might be if you have, say, ten years of data in a table but nobody ever queries further than two years back. You could create a filtered index for just those two years and get the same benefits as in the first scenario.

SQL Server Reporting Services

SSRS (SQL Server Reporting Services) has received a major overhaul. For starters, it no longer relies on IIS for its report server capabilities. This is a huge leap forward for SSRS shops because as I've found many times, it can be difficult to get approval for Windows changes, such as installing IIS. Often, these shops don't like to make major changes to a Windows install, and adding a component such as IIS, which is perceived to be difficult to secure, makes them doubly hesitant; not requiring IIS is a good thing.

Another major enhancement, called Tablix, is a new data element type that combines table, matrix, and list. Tablix gives you an easy way to work with grouping by allowing you to group on columns or rows as well as define adjacent or parent/child groups. All of this sounds fairly intuitive, but it does take a little training. I quickly found that building groups the way I did in the previous version left me out in the cold. I had to slightly change the way I authored reports to be successful.

Report Builder has come a long way as well. Microsoft has removed a number of limitations that could make it difficult to work with. Now, you can do almost anything in Report Builder that you can do in Visual Studio, and Report Builder comes with wizards that make formatting data and charts automatic. I have almost no reason to write reports in Visual Studio anymore. This new version of Report Builder (2.0) wasn't available when Katmai shipped, but it is available as a separate download now.

SQL Server Integration Services

Microsoft has taken some big steps to ensure that SSIS (SQL Server Integration Services) can compete directly with competitors such as Informatica by offering lookup caching and CDC. And while CDC isn't strictly an SSIS enhancement, SSIS does take advantage of it to increase the speed of ETL processes. In fact, both lookup caching and CDC can improve the speed of loads by orders of magnitude. Let's examine a couple situations where you'll see dramatic performance gains.

Perhaps you're loading customer data, and you have to load each customer one at a time because you have to look up other information and make load decisions based on that lookup data. In SQL Server 2005, you would put this load inside a "for each" loop container and perform the lookup on each customer as you iterate through the loop. This means that you would perform the lookup itself for every customer. Well, if you have a large lookup table -- say, something in the neighborhood of 500 million rows -- you would have to query that data for every row you import. And if the lookup table is not on the same system or if it's not indexed appropriately for your lookup, those lookups could take a lot of time.

Now, in SQL Server 2008, you can perform that lookup once for all the customers and cache the data in a local file. The upshot is that iterating through the loop will go much faster.

Another scenario where SSIS improvements come into play is in handling changed data. Let's say a customer address change needs to be reflected in the database you're loading. In SQL Server 2005, you have to perform a binary or text comparison, on a row-by-row basis, for each column you want to track. This means that the larger your customer table, the longer your lookup will be. But with CDC, you can easily see which rows and columns have changed and avoid the lookup altogether. Here again, you could reduce the time to load tremendously.

There are other enhancements in SSIS as well. Improved threading in the data pipeline will increase performance by better managing the resources and throttling the pipeline appropriately. There's also a new data-profiling task that makes it easier to identify any data purity issues. Finally, SSIS scripts can now be written in both VB and C#.

This story, "Review: Deep dive into SQL Server 2008" was originally published by InfoWorld.

Copyright © 2008 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon