Last updated on Monday 28th of February 2022 08:52:48 PM

Deduplication software

Using deduplication resources effectively

 Please note that this post is relative to old deprecated software ©XSIBackup-Classic. Some facts herein contained may still be applicable to more recent versions though.

For new instalations please use new ©XSIBackup which is far more advanced than ©XSIBackup-Classic.

Deduplication series:

... (continues) to act as your NAS or iSCSI device. You already know that deploying a ZFS FS on your server and storing the VMs there won't do the job, at least in regards to attending the voracity of a growing dynamic department or SME, with lots of users competing for your limited resources.

Using SSDs instead of regular spining HDs can improve performance, but it won't change the heart of the matter. So, what could be done in order to take advantage of deduplication?.

IT and computer science has its own particularities -but in the end any thinking head is ruled by the same basic principles that let a human realize "two trees" and "two cows" had some obscure point in common 50.000 years ago-. Disect your problem and divide it into parts that have something relevant in common.

If you host those 12 VMs, what you really have is a bunch of files, despite their function. Some of those files will be system files that will not change in time and that will be read and loaded into memory from time to time, some other files will belong to programs, and those will be loaded a number of times per day into the memory of your VMs, lastly you'll have user files that will be hit randomly, being open modified or created.

Obviously every system is different, your VMs can be a mix of database, application, web or e-mail servers. You will have to do your own disection and analisys depending on the nature of your system, but the steps to take are the same. In sake of clarity I'll asume three of them are database servers and the rest are file servers.

So far we have some clear ideas that we can start to use. A big proportion of our files will be system files, and they will use aproximately 20gb x 7 Windows servers + 4 gb x 5 Linux servers ~ 200 gb. in system files. If we deduplicate those blocks we'll end up with a system that will only need 25 to 35 gb. to store all of our servers.

A small SSD disk could handle all the VM if we used deduplication. In our case the application files will correspond to the three database servers. If all of them are SQL Server, then we will only need space to host one of them, as all the files will be deduplicated, so we should add a few gigabytes to our sum. If we were using MySQL per instance, a few mb. would suffice.

As the database servers will be tipically loaded once a day (supousing we do a nightly reboot), we could fit them with the rest of the system files and, in this particular case, logically blend the system files with the application files, this will work most of the times.

Great!, I will jot down we will need a 128 gb. decent SSD disk installed into our example server. What do we have left?: all the user files. We can expect a high degree of randomness in this type of files. Only a few blocks will be shared and all that we could achieve would be a saving around a few percentage points, thus it's not worth to place this files in a deduplicated file system, as the CPU + RAM that will take us to just compare hashes will be far beyond the benefit of the effective results. Or in other words, it'd be like shooting your feet for fun.

So, what do we do with the user files?: In our example we have them distributed accross 9 file servers. Those file servers could all share a common storage device or have their own. It matters in terms of organizing the data, but not in terms of what kind of storage to use.

The ideal thing would be to make a statistical analisys to find out what kind of files we have and how they are distributed. But in most cases we can asume user files are not a good subject for deduplication, but can on the other hand be compressed at high rates (mainly because they will be documents, spreadsheets, databases, presentations, and so on ...).

They may as well contain pictures, videos or/and audio files, depending on the number of them, their percentage, resolution, etc... they will be the piece de resistance of your user files. In any case, there is nothing you can do about it, except trying to automatically adjust their resolution, in case of the photos, and only if this resolution is downgradable from a business perspective.

Compressing a filesystem is a lot lighter, regarding the need of hardware resources than deduplicating it, so we'll use compression instead of deduplication for our user files.

O.K., we have everything we need to start to build the system. We have our "not so powerful" server, we have an 128 gb. SSD disk and let's arbitrarily asume we have 1 tb. of user files that can be compressed to a percentage of 20%, this means we will need 20 mb. for every 100 mb. of real data, so we'll need to add a big hard disk, I'll choose a 2 tb. regular HD with a decent amount of cache.

At the time of choosing how to build the system we have many options available, specially if we take into account the number of possible combinations. Anyway, to shed some light on the matter and simplify our lives I will appeal to that gamy american saying: "take it easy".

What I mean by that is that, the simpler your solution is (without sacrifying performance) the happiest your daily life will be. A license of Windows Server 2012 could do the job very well, as it can deduplicate the 128 gb. SSD as one volume and compress the second 2 tb. HD. You would only need to install this OS directly on your small server and share both volumes through NFS or iSCSI, to be later attached to your ©ESXi box, where you would have one 128 gb. deduplicated datastore to host the VMs and one 2 tb. compressed datastore to host the data.

If you don't have the budget to buy a Windows Server 2012 licence you can achieve the same results with a Linux OS, you can rely on OpenDedup or ZFS to deduplicate the 128 gb. SSD and apply compression on the 2 tb. HD.

In this case you could format the big HD with a regular FS like EXT4, still taking advantage of a LVM, as ZFS would only be interesting to deduplicate data. You are now probably wondering why using ZFS, didn't I say before that it's too exigent for our small server?.

Well, it's not the same deduplicating user files than just system files. As we saw before, system files are loaded a lot less often than user files.