Last updated on Monday 28th of February 2022 08:52:48 PM

Linux lightweight deduplication appliance

Deploying the appliance

 Please note that this post is relative to old deprecated software ©XSIBackup-Classic. Some facts herein contained may still be applicable to more recent versions though.

For new instalations please use new ©XSIBackup which is far more advanced than ©XSIBackup-Classic.

Deduplication series:

As we have seen in the previous chapters of this set of posts dedicated to deduplication, we have a number of available Linux and Windows systems that we can use to take advantage of block level deduplication. In this post we'll see how we can build a deduplication appliance, that can serve as a datastore for our daily backups with XSIBackup. With such appliance we could store tenths or even hundreds of backups in the same space where we can now only fit a few backup folders. Such a device would offer us the ability to move in time through our set of backups to recover a file lost three months ago, per instance.

If you read the first post, you know we mentioned some comments from Linus Torvalds where he declared his skepticism about filesystems built on top of FUSE, mainly becose of it running in userspace memory. In our case we only need a storage FS, where we can fit our daily backups, to be able later on to browse trough them and pick up whatever we want to restore. So, we don't need much of anything, just as long as we can read and write to our system at decent rates and restore data effectively. We do not care if we can't connect 100 users to read and write at the same time, all we need is an efficient deduplicated storage device that works with modest resources.

The set of tools that we have choosen (after thorough testing) is:

- Centos 6.7 as the base O.S.
- Lessfs as the deduplicated filesystem.
- NFS as the transport protocol.

In this tutorial chapter, we'll start with a clean CentOS 6.7 installation from a minimum installation CD, we must make sure that we only install what we need for our appliance, all unnecesary devices should also be removed from the VM, such as: floppy, usb, audio, etc... For our post we have used a 1 tb. hard disk with a default partition layout and an etx4 FS.

Minimum hardware required

We'll use Putty to connect to our newly installed CentOS 6.7 O.S. Next, once we have installed the OS, we'll remove all unnecesary software and services, starting by selinux:

# vi /etc/selinux/config => set it to => SELINUX=disabled

Next, we'll remove all unneeded services:

- crond
- iptables
- ip6tables
- iscsi
- iscsid
- postfix
- rdisc
- restorecond
- rsyslog
- saslauthd

CentOS services to be removed

It's up to you to decide if you want to keep some of these services, like the firewall iptables, iscsi daemons, etc... In any case, we just want a lightweight storage appliance for our example, so we'll remove anything that doesn't serve our direct purposes. You can use below code to disable the upper services. Note that some are removed and some other just disabled, depending on if they could be useful later on.

chkconfig crond off; \
service crond stop; \
chkconfig --del ip6tables; \
service ip6tables stop; \
chkconfig iptables off; \
service iptables stop; \
chkconfig --del iscsi; \
service iscsi stop; \
chkconfig --del iscsid; \
service iscsid stop; \
chkconfig --del mdmonitor; \
service mdmonitor stop; \
chkconfig --del multipathd; \
service multipathd stop; \
chkconfig --del netconsole; \
service netconsole stop; \
chkconfig postfix off; \
service postfix stop; \
chkconfig --del rdisc; \
service rdisc stop; \
chkconfig --del restorecond; \
service restorecond stop; \
chkconfig --del rsyslog; \
service rsyslog stop; \
chkconfig --del saslauthd; \
service saslauthd stop; \

After copying and pasting the above code in your Putty window you'll end up with a minimum set of services running in your Centos 6.7 install.

CentOS memory consumption, top command
Top command showing a memory consumption below 100mb.

Great!, now we have the base for our system, but we still need to install some software:

- First we will install the open-vm-tools, this is a package that provides the same functionality as VMWare tools, but they are open source, and also well tested, so don't worry.

- We will also need to install NFS-Ganesha. It is an NFS server in userspace, this will ensure compatibility with our userspace filesystem Lessfs.

- And also mhash, a library that will provide Lessfs with different hashing algorithms.

- And wget, a neat small tool for downloading files.

To install open-vm-tools and NFS-Ganesha we first need to install the EPEL repo in our CentOS server, so:

# yum install epel-release && \
sudo yum install open-vm-tools nfs-ganesha mhash wget

On top of that we must install FUSE libraries and also TokyoCabinet. Although we will not be using it as our database system Lessfs uses it as a dependency. So...

yum install tokyocabinet fuse-libs

Now we can start with the specific software and services that will provide us with the deduplication functionality. We will use Lessfs 1.7.0 available here for download:

Lessfs is a FUSE FS, it is probably not the most award winning deduplication FS out there, maybe as Linus Torvalds commented is a "toy" in comparison to well known "built from the base" FSs like ext(2|3|4), ntfs, reiserfs, etc... But it does what we need, it does it reasonably well, and it does it with a limited set of resources. So, we'll use it becose it satisfies our needs as the base FS for a deduplicated backup device. We don't care if it does not support many concurrent users or we need to tweak the startup script ourselves.

Apart from Lessfs binaries, we will need a key/value database to store all deduplicated blocks and their hashes. Lessfs can use various databases: Tokyo Cabinet, HamsterDB or BerkeleyDB, we will use the last, it's not the fastest, but it is the safest. Tokyo Cabinet is really fast, but its not very reliable in case of power outages, you can read more details about these facts here:

To be tidy, we would need two CentOS 6.7 installations, one of them with all the development tools installed, to compile all the needed software, and the other to be used as our production OS. In sake of concreteness I'll provide the Lessfs compiled binaries and startup script for a CentOS 6.7 OS. You can use the below command to download the needed binaries.

wget -O /usr/local/bin/lessfs && chmod 0755 /usr/local/bin/lessfs
wget -O /etc/init.d/lessfs && chmod 0755 /etc/init.d/lessfs
wget -O /etc/lessfs.cfg && chmod 0700 /etc/lessfs.cfg
wget -O /usr/local/sbin/mklessfs && chmod 0755 /usr/local/sbin/mklessfs

The last four commands will install the Lessfs binaries and the service startup script in their final location, so don't worry, you got them where you want yet. Now we need the BerkeleyDB binaries. We'll be using version 4.8. The following command will download BerkeleyDB binary compiled for CentOS 6.7.

wget -O /lib64/ && chmod 0755 /lib64/

We must make sure that FUSE is loaded on start, so we add this line to /etc/rc.d/rc.local

echo "modprobe fuse > /dev/null 2>&1" >> /etc/rc.d/rc.local

And thats it!, we have all the packages installed and we are ready to learn how to use our datastore.