Registered users
Linkedin Twitter Facebook Google+

In order to improve user's experience and to enable some functionalities by tracking the user accross the website, this website uses its own cookies and from third parties, like Google Analytics and other similar activity tracking software. Read the Privacy Policy
33HOPS, IT Consultants
33HOPS ::: Proveedores de Soluciones Informáticas :: Madrid :+34 91 930 98 66Avda. Castilla la Mancha, 95 - local posterior - 28700 S.S. de los Reyes - MADRID33HOPS, Sistemas de Informacion y Redes, S.L.Info

(c)XSIBackup Datacenter: pruning old backups

Prunning deduplicated repositories

2020-10-22. We have rewritten this post to reflect how the new pruning mechanism works, since version Up to this date pruning was slow and probably one of the weakest points in ©XSIBackup-DC that was still pending a thorough revision.

As said, pruning was still lacking a fast algorithm that would allow to prune big repositories, as its dependence on the sort binary would lead to out of memory errors when big amounts of data were at play.

As of version a full native sorting and block detection algorithm has been developed, so that getting to know which blocks can be pruned when deleting a repo folder is much, much faster.

  • Contents
  • What is pruning and how it works

    A deduplicated repository consists in a set of blocks which ordered in different ways produce different VMs at different points in time.

    When we decide to prune some repository folder, namely a set of blocks in order, what we need to achieve is deleting all those blocks that belong to that particular set exclusively. In our first image in the post, those sets are represented by the portions in green. All other blocks which are shared with some other backups must be kept to preserve the other sets' integrity.

    That's a conceptually simple task to achieve, nonetheless things start to get tough when you have to identify which blocks can be deleted among some millions, as deciding whether a given block should be pruned implies looking for it among the general block set, or to be more precise among the general block set except the part being pruned.

    Prune blocks exclusive to one backup set

    What's new about pruning Up until version, pruning would depend on the classic Linux binary sort. This worked well, but would leave pruning capacity up to that of sort. Sort binary in ©ESXi is Busibox's one, this binary has somewhat low capacity that would cause the prune process to run out of memory at relatively low figures. Still you could manage to prune repositories which host many terabytes of real data and up to millions of blocks.

    Some users would solve this by using a bigger block size, which is fine, still we needed to improve this action and make it capable of managing a virtually unlimited amount of data.

    Former implementation would not only depend on sort, but also would use a small index on the data being sought to minimize memory usage. Nonetheless memory and CPU cycles are in a constant tradeof balance, and using small indexes implies a greater number of CPU cycles employed for each search.

    As ©XSIBackup-DC aim is to become as versatile as possible and compatible with big data sets, we have redesigned pruning to be faster and to be able to manage terabyte sets easily.

    So now it uses qsort() native C function plus a single pass remove duplicate function. In regards to indexing, we have created an adaptative depth index mechanism, such that depending on the number of blocks to manage, the depth of the index varies.

    As a result we now offer a pruning argument that can detect the blocks to prune among millions of blocks in a negligible amount of time. Thus, if you are pruning a big repo, most of the time will be employed in effectively deleting the pruned blocks. Loading the data would take some seconds per one million blocks, the amount of time will vary greatly depending on your storage hardware.

    Best practices

    Pruning is a local operation, that means that it is run in the server hosting the data. Even if you run it over an IP link stablished by means of an RSA key, the operation will be run by the server component.

    Anyhow, you must pay attention to some special situations in which you may fall if you aren't fully aware of your environment. If you attach some NFS or iSCSI device via ethernet to your host, it will appear as local storage to the server, in any case there will still be a network in between. Everytime you trigger some system call over, let's say NFS, per instance a block removal, you will have to add the network latency to the storage system data lookup.

    If you take on account that a typical HD seek time is in the order of 10 ms, an SSD in the order of one tenth of a millisecond and your network latency could vary between 1 and 5 ms if your network hardware is behaving well, you will realize that it's not the same doing things one way or another, using HDs or SSDs or operating in a congested network.

    Not only the above hardware related figures will condition your prune operations, but the NFS or iSCSI stack will add an additional latency to that attributable to the hardware.

    Empiric figures show that running a prune operation from an ©ESXi host to an attached NFS NAS or iSCSI target will take much, much longer that doing it directly on the backup server. So always deploy your ©XSIBackup-DC solution taking this into account

    Daniel J. García Fidalgo
    This page was las modified on 2020-10-23

    Website Map
    Resources & help
    Index of Docs
    33HOPS Forum

    ©33HOPS site relies on the following technologies and technology partners:
    SSL Protocol PayPal Payment Gateway Stripe Payment Gateway

    ©33HOPS Sistemas de Información y Redes, S.L. | VAT No: ESB83583716 | Avda. Castilla la Mancha, 95, local posterior, 28701 San Sebastián e los Reyes (Madrid) Spain

    Fill in to download
    The download link will be sent to your e-mail.

                Read our Privacy Policy