Data deduplication is arguably the most important new feature for backup and archival solutions in years. Data deduplication can save your organization time and money, and therefore it is worth investigating to determine whether it is appropriate for your data environment. Although predominately employed for backup and archival processes today, data deduplication for primary storage is on the horizon, which underscores the significance of this technology.
Data deduplication has grown in awareness and popularity due to the continual explosion of data in the business world. For reference, some marketing pundits are now referring to this technical process as "capacity optimization." Whatever it's called, the overall goal is to perform data backups faster while still maintaining data integrity.
As the name implies, data deduplication is the process of removing redundant data. This is most common for the purpose of making the data backup process more efficient, but it can also provide benefits for data replication. The first point to understand is that deduplication can be implemented in different ways.
Levels of Data Deduplication
As the phrase implies, duplicate files are eliminated, and pointers or links are put in their place that all direct to the remaining single instance of the file. File level data deduplication is also referred to as Single Instance Storage.
This process is accomplished by comparing target files that are candidates for the backup process to files that are already archived, referencing the attributes stored in an index file. For example, imagine that we want to back up several host servers that all have the same operating system installed. There will indeed be many identical system files (static files that never get modified) which are obvious candidates for this process.
Even in this simplistic example, you can begin to understand how the math begins to work to your advantage especially if the duplicate files are numerous, or large in size. This implementation of data deduplication is typically found as a feature within a backup software application, such as EMC® Avamar®.
The block level is also referred to as the sub-file level because it is underneath the file layer and therefore it is a more granular approach. In this regard, you will hear people refer to the parts of the file as blocks, chunks or segments.
For example, you may have a situation where three files are mostly identical except for one or two data blocks. In comparison to file level data deduplication where entire files are dealt with, using a block-level process allows us to achieve much greater space savings because only the unique blocks are processed, as opposed to the entire file.
As with the file-level process, an index is referenced to determine whether a block of data is a candidate. What we're actually dealing with here are rather complex mathematical algorithms. This can all get rather technical, and different vendors employ unique techniques for each circumstance depending upon the data being evaluated.
It can be said that the byte level method of data deduplication is even more granular than the block level method, but a caveat that should be mentioned — the more granular the solution, the more processing power required. Because of this fact, byte level data deduplication is usually implemented in the form of a purpose-built appliance.
Using this method, the data stream is analyzed and compared byte-by-byte to a previously stored stream. This method performs data deduplication post-process, meaning that your data is backed up in its native state onto a disk appliance, and once it is all there the deduplication process begins. A sufficient amount of hard disk space is required on the appliance as it needs room to perform its work. However, once it is finished the working disk space is made available again.
Some vendors will actually retain an unmodified, full backup so that a potential full restore can be performed quickly without the added overhead of un-deduplicating the data as it gets restored.
Proponents of the byte level methodology suggest that disk space is inexpensive and consequently we should leverage this fact to make great gains in speed by not deduplicating in-line or on-the-fly, both of which add significant processing overhead. This is not a trivial point and if backing up as quickly as possible is your primary goal, then you should most definitely consider a purpose-built appliance that does post-processing of your data.
In-Line Versus Post-Processing Deduplication
As stated above, if the main objective is to make your data backups as fast as possible, then post-process is what you should consider. Conversely, if you are more concerned about conserving space on your backup disk target, then in-line is the better choice.
If you are considering the replication of data from one location to another over a WAN link, then in-line processing may be a better choice. If your data is mostly static files that do not get modified very often (if at all), then a file-level solution may meet your requirements, and would probably cost less than either a block-level or byte-level appliance.
In any case, you should endeavor to clean up your data first as a best practice — there is no sense in working with data that truly does not need to be saved, regardless of the chosen backup methodology.
Ready to get started? General advice for the data deduplication process:
1) Evaluate and clean up your data first.
Your first step should be to evaluate your data. Is your data mostly comprised of common Microsoft office files? This type of data is a good candidate for a data deduplication solution.
Is your data mostly comprised of video, audio, image, imaging database and encrypted files? These types of files do not deduplicate very well (or at all as is the case with encrypted files).
Cleaning up your data usually requires setting one or more policies regarding the data itself. For example, a company may have a policy that any files with an mp3 extension will be automatically deleted and therefore not retained. A data retention policy can include and lend consideration to a multitude of criteria that can be used in combination; i.e. size of file, type of file, date of last file access, file extension, archive status and more. If this sorting out of data can be accomplished prior to backup or replication then the time required for the process will accordingly be decreased.
2) Don't be too complicated.
Sometimes people think that the more stuff they buy, the better off they are. You may be tempted to think that compressing your data before you send it over to the data deduplication appliance would be a good idea, but this is often not the case. The appliance needs to "see" the data in an unaltered state (and most appliances will already include some form of compression). Try to avoid layering solutions on top of one another, unless you are absolutely certain of the compatibility aspects.
3) Ask for a demo.
Vendors that are certain of the suitability of their solution for your data environment should be willing to provide a demo system for your evaluation. Furthermore, they should be willing to assist with the installation and configuration of the solution. You should also test more than one solution to get a complete idea of options, regardless of how impressed you may be with the first one that you evaluate.