© XSITools: block level deduplication over VMFSXSITools is introduced as a new feature in XSIBACKUP-PRO 9.0. It will first offer block de-duplication over VMFS and any other proper file system like: ext3, ext4, XFS, NTFS, FAT32. etc..., that allows to store some tenths of thousands of files per volume. As it evolves it will offer block synchronization for off site backups.
VMFS 5 storage limits and XSITools
According to VMWare's documents: https://www.vmware.com/pdf/vsphere5/r55/vsphere-55-configuration-maximums.pdf
An VMFS5 volume can store up to 130,690 files. That number should include dirs too. This means that, making some quick figures, if you have 130.000 free inodes for block storage and each block is 50 mb. big, you could host up to 6.5 tb. of real data in chunks. That could easily range from 30 to 70 tb. of deduplicated data, which is more than enough to consider XSITools repositories as a serious backup storage engine. You can obviously have one or more XSITools repos per volume, so you can easily grow your backup space by adding more disks, or by splitting disks into smaller volumes.
XSITools uses block level de-duplication with a big block size, 50 mb. by default. We might offer adjustable granularity in the future to better control off site backup data block outer flux. Some of you who are already familiarized with de-duplication at the FS layer might be thinking: "this guys have gone crazy, how come they use a 50 mb. block size to de-duplicate data". Well, the answer is simple, because it's more efficient. It's not more efficient if you only consider minimizing disk space by applying de-duplication as the only goal. It is more efficient when dealing with a wider scope goal: backing up a fixed set of production virtual machines.
Block de-duplication is a must when recursively backing up virtual disks, but using a small block size makes no sense in this context. De-duplication offers the big advantage of storing data chunks uniquely, it is a simple and brilliant idea from a conceptual point of view. It only has one disadvantage, it is resource hungry, it is so CPU and memory intensive that when it comes to backing up terabyte files you might end up needing extremely powerful (and therefore expensive) servers to handle billions of tiny chunks of data.
A fixed set of virtual machines in production constitutes a set of big files that will only experiment limited changes from one backup cycle to the next. It is database and user files that will get changed from one day to the next. Still, most of those files will remain unchanged, as well as the vast majority of the OS constituent parts, that will as well, remain aligned to the previous day version in terms of data structure.
Engineers' task is not to build perfect solutions, perfection is a job for mathematicians, engineers must offer reasonable and practical solutions to every day's problems. Thus, as a software designer, I don't care much to copy 50 mb. of data when only one byte has changed in the extent, if as a counter objective I am saving 99% of the CPU time and freeing the memory from a huge data storm that would freeze my server. There's a thin line that depicts the optimum trade-off path that we must follow to accomplish our task with the minimum set of resources. As mentioned throughout my posts, there must always be a philosophy behind every complex tool, as stated before, I'll follow the market trader quote "let others win the last dollar".
So, as an excerpt: XSITools uses a big block size, because the resources needed to copy 50 mb. are much limited than those required to de-duplicate 12800 4k blocks, and the disadvantage derived from using a big block size is an increment in the real space used by deduplicated data. Still a big space saving ratio is reached while enjoying the benefits of a lightweight deduplication engine.
The above is valid when dealing with a local scenario: local storage pools or NAS in a gigabyte or 10gb. LAN. When we need to replicate data across "not so wide" networks, like WANs, things start to change. In this case, the most valuable resource might be our Internet bandwith, and in this new circumstances, we might do well by "winning" even the last dollar available.
We will be offering block based replication in short, programmed for v. 9.2.0. When that time comes, we will probably offer a way to tweak granularity, so that when a 20k file changes, a 50 mb. block is not sent over the network. A good balance would probably be at 10 mb. block size. I'll keep this post updated as new functionalities are added...