#1 2021-05-23 18:09:00

kd.gundermann
Member
Registered: 2018-02-27
Posts: 20

XSIBackup-DC Perf Test

I had some time to do some Perf Tests with XSIBackup-DC 1.5.0.3

My setup:

Host:  VMWare 7.0 on HP DL360  G10  Intel Xeon Silver 4110 @ 2.1 Ghz
Storage 1 (VM Images):  Synology NAS RS820RP+ with 4 x Samsung 1,8 TB SSD Raid5
Storage 2 (Backup): Synology NAS  DS1817+ with 4 x WDC 8TB HDD Raid5
both storages mounted with NFS to host

Throughput 1 (tested with VMWare I/O Analyzer)

Storage 1 : 800 MB/sec read
Storage 2 : 100 MB/sec write 16kB Blocks 0% random

Throughput 2 (xsibackup --replica=cbt first run)

Storage 2 : around 25-30 MB/sec

looking at the block count of the xsibackup backup there is a short pause of about 1/2 sec after every 36 blocks ..

Running IO-Meter in parallel to a second NAS gives following numbers:

	ESXTOP Disk Statistics (Device View)
Host	Physical Disk Device	DQLEN	ACTV	QUED	%USD	LOAD	CMDS/s	READS/s	WRITES/s	MBREADS/s	MBWRITES/s	DAVG/cmd (msec)	KAVG/cmd (msec)	GAVG/cmd (msec)	QAVG/cmd (msec)
192.168.10.102	NFS-vmware04	0.00	0.67	0.00	0.00	0.00	235.30	216.45	18.85	13.78	0.26	0.00	0.00	2.37	0.00
192.168.10.102	NFS-xsibackup01	0.00	0.00	0.00	0.00	0.00	13.86	0.00	13.86	0.00	13.74	0.00	0.00	27.96	0.00
192.168.10.102	NFS-xsibackup02	0.00	15.17	0.00	0.00	0.00	5637.65	0.01	5637.65	0.00	88.09	0.00	0.00	2.58	0.00

NFS-vmware04    : Backup source (VM)
NFS-xsibackup01 : Backup destination
NFS-xsibackup02 : IO-Meter Destination

Offline

#2 2021-05-24 07:50:11

admin
Administrator
Registered: 2017-04-21
Posts: 1,658

Re: XSIBackup-DC Perf Test

Thank you very much for your feedback.

Using NAS to store VMs is not a very good idea, unless you have FO or 10GB NICs, as you would create a 1GB bottleneck that you would not have if you used local storage.

The glitch every few blocks is probably caused by some buffer filling up.

On the other side, maximum speed is achieved when using SSD as the target of backups. Over IP backups yield better speed than local backups too, as you balance the load of the backup process among two CPUs instead of one.

Offline

#3 2021-05-24 20:26:48

kd.gundermann
Member
Registered: 2018-02-27
Posts: 20

Re: XSIBackup-DC Perf Test

Thank you for your response.

I understand that 1 GB NICs may be a bottleneck (and it limits IOPerf to 88 MB/s)
but xsibackup only achives 13.74 MB/s, thats only 1/6 th of the bottleneck
(see numbers above)

As suggested I run another test with a 4xSSD RAID-5 as backup target,
it took 0:30 h to backup 58 GB:

Backup end date: 2021-05-24T23:46:13
-------------------------------------------------------------------------------------------------------------
Time taken: 00:30:12 (1812 sec.)
-------------------------------------------------------------------------------------------------------------
Total time:     1812 sec.
-------------------------------------------------------------------------------------------------------------
Full file speed:                                                                            33.14 mb/s
-------------------------------------------------------------------------------------------------------------

Last edited by kd.gundermann (2021-05-25 08:17:26)

Offline

#4 2021-05-25 10:56:58

admin
Administrator
Registered: 2017-04-21
Posts: 1,658

Re: XSIBackup-DC Perf Test

It's not only the theoretical 88MB/s bottleneck. You aren't taking many other things into account:

1/ That's just a theoretical limit. Depending on your cables, switch (this has a huge impact) and NICs (also a big concern), you may get figures well below that limit.

2/ Are there any other VMs running while you do the backups?. Network bandwidth doesn't simply consist in adding up partial figures. Depending on your switch and network equipment, summing up different flows can degrade performance.

3/ SSH encryption and encapsulation require a modern CPU. Intel has SHA extensions wich will greatly enhance performance.

4/ Backing up from one non-local disk DS to another one will add up latency and require the use of the same CPU. Latency to local devices is really low on the SCSI bus. Remote attached FS will add the network latency to I/O latency.

5/ Backing up over IP will split the CPU load between client and server, still the best results will be obtained when backing up from local disks or FO to a fast NAS (SSD or nVME even better) over 10GB.

Offline

#5 2021-05-26 13:27:52

kd.gundermann
Member
Registered: 2018-02-27
Posts: 20

Re: XSIBackup-DC Perf Test

Please don't get me wrong, I love XSIBackup, we are using it for now over 4 years and its fantastic that you created XSIBackup-DC.

But I am just curious
1. how good will XSIBackup-DC perform
2. why does the backup of our SQL-Server machine takes so long

I am doing performance analysis for now over 30 years, so I am fully aware of the points you mentioned above.
So my setup is quite simple:
- HP DL360 Gen10 with 8 CPUs x Intel(R) Xeon(R) Silver 4110
- 2 x Synology NAS RS820RP+ with 4 x Samsung 1,8 TB SSD RAID
- connected to a single Cisco SG350X switch

> It's not only the theoretical 88MB/s bottleneck

The 88MB/s is not a theoretical figure but was measured on this system running IOPerf with 16Kb blocksize

Running a dd job which copies the data from one storage to the other WITHIN a virtual machine gives even better figures:

root@buddgie-vm:/mnt/source# time dd if=../source/bigfile of=../target/bigfile bs=1M
9216+0 Datensätze ein
9216+0 Datensätze aus
9663676416 Bytes (9,7 GB, 9.0 GiB) kopiert, 92,1166 s, 105 MB/s

real   1m32,121s
user   0m0,046s
sys     0m25,742s

I tried cp and dd on the esxi shell too, and they achived the same speed as xsibackup: around 34 MB/S

There is a old case study from Intel from 2010 (11 years ago!): https://www.intel.com/content/dam/suppo … yfinal.pdf
where they used scp and rsync and stated:
"but the standard tools are not very well threaded, so they don't take full advantage of .. this hardware platform"
and recommend: "Choose the right tools .. and use more parallelism when possible"

( I had the problem on Windows too, when I tried to migrate a mailstore with millions of small files to a new host.
  The standard tools were to slow, so I developed a tool which uses multiple threads and large buffers and is able to saturate the 1GB link)

So I am curious if xsibackup uses a multithreaded, buffered architecture
or is there really an artifical performance slowdown for the esxi shell ? ( someone mentioned this on stackexchange)
What are the best figures you are getting with a 1GB link?

( And yes, I am just talking about the speed for the first run of "xsibackup --replica" )

Thank  you for your good work!

Klaus

PS: I will try to setup some tests using backup over IP. If you are interested I may report about these tests

Last edited by kd.gundermann (2021-05-26 13:28:52)

Offline

#6 2021-05-26 15:56:10

admin
Administrator
Registered: 2017-04-21
Posts: 1,658

Re: XSIBackup-DC Perf Test

Of course not, we don't get you wrong. Thank you for your feedback, this is indeed the kind of talk we like.

dd just copies data in blocks from one file descriptor to another. I wonder if it uses mmap, probably not, it's a utility to slice files, a wondeful one I must say, nonetheless they probably didn't tune it for speed.

(c)XSIBackup-DC is quite optimized, we still have room for some improvement and pthread'ing is one of the improvements in our todo list. Thus, no, we haven't prioritized paralelism so far, why?

1/ Hashing & compressing data works at over 100 MB/s on an average core, given that you backup over IP and thus you don't need the server part to be using the same CPU. We call "average core" to a single thread rating of 2000 at CPUBenchmark. If it's a second run the processing of data happens at around 250MB/s on an average core, which is about the limit of a modern SATA HD. All of that on top of being zero aware.

2/ We are already applying parallelism since (c)XSIBackup-DC can work as a client/server.

3/ Full throttle is only required on the first pass, subsequent ones will hash and compare at 250MB/s on commodity hardware.

4/ We are now offering native CBT support, thus the above limit is simply waived by knowing which blocks have changed on advance.

5/ One of the aims of (c)XSIBackup has always beed to work in the background with a minimal footprint. Running in a single core is some good way to not create too much disturbance in a production system.

The above reasons have kept us from applying pthreads.h by now.

On the other side. Your setup is optimized for paralellism, low temp and low power consumption, still not for speed.
If you take a look at your CPU benchmark, you will see that a single core performance is below average.

(c)Intel (c)Xeon Silver 4110

Just in case, remember that you must enable your smart array controller cache to obtain maximum performance in the (c)ESXi shell.
Enable your HP controller cache

Also, Windows OSs don't play well when virtualized. Microsoft has a kind of given up on the OS manufacturing niche. You probably already know this, but we'll take the chance to offer this link to an interesting post on block alignment.

VMFS and storage block alignment

When the guest FS is not aligned with the VMFS FS, this may imply that the guest needs to access more VMFS blocks to access the data it needs, if on top of that the underlying hardware's blocks aren't aligned to the VMFS ones, you can get into a situation in which the performance yielded by the system is heavily degraded.

In your case it's only the VMFS to datastore volume block alignment that matters.

Offline

Board footer