
Data De-Duplication
Top 10 Questions (and Answers) People Ask About Data De-duplication
1. What does
the term "data de-duplication" really mean?
There's really no industry-standard definition yet, but we're getting
close. Everybody agrees that it's a system for eliminating the need to store
redundant data, and most people limit it to systems that look for duplicate
data at a block not a file level. That's an important feature. Imagine 20
copies of a presentation that have different title pages–to a file-level
data reduction system they look like 20 completely different files. Block
level approaches would see the commonality between them and use much less
storage.
The most powerful data de-duplication uses a variable-length block approach.
Products using this approach look at a sequence of data, segment it into
variable length blocks, and when they see a repeated block, they store a
pointer to the original instead of storing the block again. Since the
pointer takes up less space than the block, you save space. In backup, where
the same blocks show up over and over, users can typically store 10 to 50
times more data than on conventional disk.
2. How
can data de-duplication be applied to replication?
Replication is the process of sending duplicate data from a source to a
target. If you replicate all the backup data then you need a relatively high
performance network to get the job done. But with de-duplication, the source
system–the one sending data–looks for duplicate blocks in the replication
stream. If it has already transmitted a block to the target system, then it
doesn't have to transmit it again–it simply sends a pointer. Since the
pointer is much smaller than the block, we need much lower bandwidth
networks for replication.
3. What
applications does data de-duplication work with? Are there any that it
doesn't work with?
When it's being used for backup, it supports all applications–email,
databases, print and file applications, etc–and all qualified backup
packages. Variable block length de-duplication can find redundant blocks in
the backup stream for all of them. Certain file types–some rich media files,
for example–don't see much advantage the first time they are sent through
de-duplication because the applications that write the files already
eliminate redundancy. But if those files are backed up multiple times or
backed up after small changes are made, de-duplication can have very
powerful capacity advantages.
4. Is
there any way to tell how much de-duplication advantage I will get with my
data?
There are really four primary variables. How much the data changes (that is,
how many new blocks get introduced), how well it can compress, what your
backup methodology is (full vs. incremental, for example), and how long you
plan to retain the data. Some vendors–Quantum is one–offer sizing
calculators to estimate the effects.
5. What
is the real benefit of using data de-duplication?
There are really two. 1) Data de-duplication technology lets you keep more
backup data on disk than with any conventional disk backup system–which
means you can restore more data faster. 2) It makes it practical to use
standard WANs and replication for DR protection–which means users can reduce
their tape handling.
6. What
is variable-block length data de-duplication? How do you get variable-length
blocks and why would I want them?
It's easiest to think of the alternative. If you divided a stream of data
into fixed-length segments, every time something changed at one point, all
the blocks downstream would also change. The system of variable-length
blocks allows some of the segments to stretch or shrink, while leaving
downstream blocks unchanged–this increases the ability of the system to find
duplicate data segments, so it saves significantly more space.
7. If
the data is divided into blocks, is it safe? How can it be restored?
The technology for using pointers to reference a sequence of data segments
has been standard in the industry for decades, you use it every day, and it
is safe. Whenever you write a large file to disk, it is stored in blocks on
different disk sectors in an order determined by space availability. When
you "read" a file, you are really reading pointers in file's metadata which
point to the various sectors in the right order. Block-based data
de-duplication applies a similar kind of technology. And de-duplication
vendors typically build in a variety of data integrity checks to verify that
the system is sound and the data remains available.
8.
Where does data de-duplication take place during the backup process?
There are really two choices. You can send all your backup data to a backup
target and perform de-duplication there, or you can perform the
de-duplication on the host during backup. Both systems are available and
both have advantages. If you de-duplicate on the host during backup, you
send less data over your backup connection, but you have to manage software
on all the protected hosts, backup slows down because de-duplication adds
overhead, and it can slow down other applications running on the host
server. If you de-duplicate at the backup target you send more data over the
connection, but you can use any backup software, you only have to manage a
single target, and the performance is normally much higher because the
hardware system is specially built just for de-duplication.
9. Can
de-duplication technology be used with tape?
No and yes. Data de-duplication needs random access to data blocks for both
writing and reading, so it needs to be implemented in a disk based system.
But tape can easily be written from a de-duplication data store and in fact
that is the norm. Most de-duplication customers plan on keeping a few weeks
or months of backup data on disk, and then use tape for longer term storage.
When you create a tape from de-duplicated data, the data is re-expanded so
that it can be read directly in a tape drive and will not have to be written
back to a disk system first.
10.
What do data de-duplication solutions really cost?
There's a lot of variability, but there is a pretty good rule of thumb
starting point. Assuming an average de-duplication advantage of 20:1–that's
a number widely used in the industry–we have seen list prices in the range
of $1/GB. So a system that could retain 20TB of backup data would have a
list price of around $20,000–that's much lower than if you protected the
same data using conventional disk. A note: options could increase that
price–and discounts from resellers or vendors could reduce it.
------------------------------------------------------------------------------------------------------------------------------------------------------------
Data Encryption
Over the past few years, data security breaches have cost companies millions of dollars and inflicted significant damage to the corporate images of these firms. With concerns around data security mounting, businesses of all sizes are beginning to integrate encryption into their backup and archive processes. Quantum understands that protecting data at rest and in transit are key challenges facing IT professionals today and has integrated data encryption features into its leading disk and tape solutions.
The cost of a data security breach continues to rise. According to the Ponemon Institute, data security breach incidents now cost companies $197 per compromised record, including lost opportunities and reputation as well as legal, investigative, administrative and customer support expenses. Losses associated with customer churn and acquisition account for 65 percent of data security breach costs.
Encryption can dramatically reduce, if not eliminate, the risk of a data security breach. That’s why a growing number of government and industry regulations call for the encryption of sensitive data. Many states require that companies disclose all data security breaches of non-encrypted data to the media and all customers potentially affected. Specific industry associations are also taking action to drive security standards, such as the Payment Card Industry (PCI) Data Security Standard. This standard mandates the encryption of stored data, including data on backup tapes, and noncompliance can result in monetary penalties ranging from $5,000 to $50,000 per month. Finally, a number of bills before Congress would require companies that store specific types of consumer data to establish security safeguards such as encryption.
IT managers are faced with the challenge of integrating encryption into their backup, recovery, and archive processes. This additional business requirement introduces another technical dimension to an already complex set of processes, leaving users with important questions to resolve:
How will I add encryption without affecting the backup window?
Will this change my backup processes and software environment?
How will I manage the encryption process?
Can I encrypt data being transported between sites, both via replication and on tapes?
Without the right approach and architecture, users will be forced to make painful tradeoffs to achieve data security and may be forced to settle for poor performance, hardware or software dependencies, and complex management.
Quantum understands the issues associated with encryption within backup, recovery, and archive. We offer encryption options for both our disk and tape solutions, giving customers the flexibility to choose what fits best with their business requirements. For disk-based backup, Quantum’s DXi-Series incorporates de-duplication and fully-encrypted replication, allowing customers to securely link sites for enterprise-wide backup and disaster recovery. For tape systems, Quantum’s Encryption Key Manager (Q-EKM) is available for Scalar i500 and Scalar i2000 tape libraries in conjunction with LT0-4 drives.
-----------------------------------------------------------------------------------------------------------------------------