Saturday, October 11, 2008

Whoop de Doop for De-Dupe

De-duplication started out as a way to do backups without having to store mostly the same stuff over and over again. Companies like Data Domain, Diligent Technologies, and NetApp provided de-dupe of virtual tape libraries and direct-to-disk backup targets, providing full backups that stored only the changes since the previous backup. The result: You could reap the same space savings you get with incremental backups but without the necessity for multiple restores to re-create an entire volume.
Now these same companies are advertising de-duplication of near-line storage, and even online storage in NetApp’s case, while other vendors are using de-duplication to reduce WAN traffic, shrink the size of databases, or compress e-mail archives. Yes, de-duping is going gangbusters. Heck, we might even dream of the day when you might never need more than one copy of any file throughout the entire enterprise. Assuming it’s possible, is that something you’d want?
Currently, all storage de-duplication requires a gateway between the server and the storage. Methods of de-duplication vary widely. Some solutions function at the file level, some at the block level, and some work with units of storage even smaller than blocks, variously referred to as segments or chunklets. Processing for de-duplication can occur either "in-line" (i.e., before the data is written to storage) or "post process" (meaning after the data is initially written).
There are applications where de-duplication is extremely effective, and ones where it isn’t. If data is largely the same, such as multiple backups of the same volume or boot images for virtual servers, de-duplication can provide enormous reductions in the storage space required. However, dynamic data, such as transactional databases or swap files, will show very little reduction in size and may also be sensitive to the latency introduced by de-duplication processing. In the case of databases, though, de-duplication can in fact improve I/O performance and speed up some queries (see "Oracle Database 11g Advanced Compression testbed, methodology, and results").
But the biggest issue with de-duplication is that it creates a choke point: All data to be compressed must be saved and retrieved through the de-duplication gateway. This isn't much of an issue with backups or even near-line archives. But for applications where access to the data becomes critical, or usage is heavy, the gateway becomes a hot spot, requiring redundant gateways, dual-path SAN infrastructure, and redundant storage. Given the investment necessary to support live data, where even short interruptions to access would cause major problems, it is typically cheaper to live with multiple copies.
There’s no question that de-duplication can provide great benefits in specialized applications, including backups, e-mail archives, and other cases where data is largely repetitive, such as VMware boot images. However, a fully de-duplicated enterprise, even if feasible, would require a massive and expensive infrastructure. Given that disk capacity continues to grow in leaps and bounds, scaling out de-duplication will be difficult to justify. It’s cheaper to keep buying more local storage than to put all the eggs in one basket.

No comments: