Category Archives: performance

VM Network Performance and CPU Scheduling

Over the years, I’ve been on quite a few network performance cases and have seen many reasons for performance trouble. One that is often overlooked is the impact of CPU contention and a VM’s inability to schedule CPU time effectively.

Today, I’ll be taking a quick look at the actual impact CPU scheduling can have on network throughput.

Testing Setup

To demonstrate, I’ll be using my dual-socket management host. As I did in my recent VMXNET3 ring buffer exhaustion post, I’ll be testing with VMs on the same host and port group to eliminate bottlenecks created by physical networking components. The VMs should be able to communicate as quickly as their compute resources will allow them.

Physical Host:

  • 2x Intel Xeon E5 2670 Processors (16 cores at 2.6GHz, 3.3GHz Turbo)
  • 96GB PC3-12800R Memory
  • ESXi 6.0 U3 Build 5224934

VM Configuration:

  • 1x vCPU
  • 1024MB RAM
  • VMXNET3 Adapter (1.1.29 driver with default ring sizes)
  • Debian Linux 7.4 x86 PAE
  • iperf 2.0.5

The VMs I used for this test are quite small with only a single vCPU and 1GB of RAM. This was done intentionally so that CPU contention could be more easily simulated. Much higher throughput would be possible with multiple vCPUs and additional RX queues.

The CPUs in my physical host are Xeon E5 2670 processors clocked at 2.6GHz per core. Because this processor supports Intel Turbo Boost, the maximum frequency of each core will vary depending on several factors and can be as high as 3.3GHz at times. To take this into consideration, I will test with a CPU limit of 2600MHz, as well as with no limit at all to show the benefit this provides.

To measure throughput, I’ll be using a pair of Debian Linux VMs running iperf 2.0.5. One will be the sending side and the other the receiving side. I’ll be running four simultaneous threads to maximize throughput and load.

I should note that my testing is far from precise and is not being done with the usual controls and safeguards to ensure accurate results. This said, my aim isn’t to be accurate, but rather to illustrate some higher-level patterns and trends.

Simulating CPU Contention

The E5 2670 processors in my host have plenty of power for home lab purposes, and I’m nowhere near CPU contention during day-to-day use. An easy way to simulate contention or CPU scheduling difficulty would be to manually set a CPU ‘Limit’ in the VM’s resource settings. In my tests, this seemed to be quite effective.

cpusched-2

The challenge I had was what appeared to be a brief period of CPU burst that occurs before the set limit is enforced. For the first 10 seconds of load, the limit doesn’t appear to work and then suddenly kicks in. To circumvent this, I ran an iperf test until I saw the throughput drop, cancelled it, and then started the test again immediately for the full duration. This ensured that the entire test was done while CPU limited.

With an enforced limit, the receiving VM (iperf-test2) simply couldn’t get sufficient scheduling time with the host’s CPU during load. CPU ready time was very high as a result.

Below is esxtop output illustrating the effect:

   ID      GID    NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT

 5572782  5572782 iperf-test2         6   66.50   32.82   28.20  571.70    0.17    0.12   67.73    0.01    0.00    0.00    0.00
 5572767  5572767 iperf-test1         6   37.77   29.20    3.56  505.83    0.13   69.60    3.19    0.02    0.00   69.52    0.00
   19840    19840 NSX_Controller_     9   11.60   12.63    0.36  894.20    0.00    0.17  391.13    0.04    0.00    0.00    0.00
   19825    19825 NSX_Controller_     9    9.22   10.43    0.11  896.38    0.00    0.19  392.96    0.08    0.00    0.00    0.00
   20529    20529 vc                 10    9.13   11.97    0.14  995.69    0.53    0.14  189.26    0.03    0.00    0.00    0.00
<snip>

With a CPU limit of 800MHz, we can see a very high CPU %RDY time of about 69%. This tells us that the guest was ready to schedule work for the CPU to do, but 69% of the time it couldn’t get scheduling time.

Let’s have a look at the test results.

Interpreting The Results

To simulate varying degrees of contention, I re-ran the test described in the previous section after decreasing the CPU limit in 200MHz intervals. Below is the result:

cpusched-1

The actual results including the CPU limit set is as follows:

No Limit (Full Frequency): 20.2 Gbps
2600MHz (100%): 19.2 Gbps
2400MHz (92%): 17.7 Gbps
2200MHz (85%): 16.4 Gbps
2000MHz (77%): 14.7 Gbps
1800MHz (69%): 12.1 Gbps
1600MHz (62%): 10.3 Gbps
1400MHz (54%): 8.5 Gbps
1200MHz (46%): 6.5 Gbps
1000MHz (38%): 5.2 Gbps
800MHz (31%): 4.1 Gbps
600MHz (23%): 2.7 Gbps
400MHz (15%): 2.2 Gbps

The end result was surprisingly linear and can tell us several important facts about this guest.

  1. CPU was clearly a bottleneck for this guest to process frames. Throughput continues to scale as CPU scheduling time increased.
  2. This guest probably wouldn’t have difficulty with 10Gbps networking throughput levels until it’s CPU scheduling ability dropped below 70%.
  3. With Gigabit networking, this guest had plenty of cycles to process frames.

Realistically, this guest would be suffering from many other performance issues with CPU %RDY times that high, but it’s interesting to see it’s impact on the network stack and with frame processing.

Conclusion

My tests were far from accurate, but they help to illustrate the importance of having ample CPU scheduling time for high network throughput. If your benchmarks or throughput levels aren’t where they should be, you should definitely have a look at your CPU %RDY times.

Stay tuned for other tests. In a future post, I hope to examine the role offloading features like LRO and TSO play in throughput testing. Please feel free to leave any questions or comments below!

VMXNET3 RX Ring Buffer Exhaustion and Packet Loss

ESXi is generally very efficient when it comes to basic network I/O processing. Guests are able to make good use of the physical networking resources of the hypervisor and it isn’t unreasonable to expect close to 10Gbps of throughput from a VM on modern hardware. Dealing with very network heavy guests, however, does sometimes require some tweaking.

I’ll quite often get questions from customers who observe TCP re-transmissions and other signs of packet loss when doing VM packet captures. The loss may not be significant enough to cause a real application problem, but may have some performance impact during peak times and during heavy load.

After doing some searching online, customers will quite often land on VMware KB 2039495 and KB 1010071 but there isn’t a lot of context and background to go with these directions. Today I hope to take an in-depth look at VMXNET3 RX buffer exhaustion and not only show how to increase buffers, but to also to determine if it’s even necessary.

Rx Buffering

Not unlike physical network cards and switches, virtual NICs must have buffers to temporarily store incoming network frames for processing. During periods of very heavy load, the guest may not have the cycles to handle all the incoming frames and the buffer is used to temporarily queue up these frames. If that buffer fills more quickly than it is emptied, the vNIC driver has no choice but to drop additional incoming frames. This is what is known as buffer or ring exhaustion.

A Lab Example

To demonstrate ring exhaustion in my lab, I had to get a bit creative. There weren’t any easily reproducible ways to do this with 1Gbps networking, so I looked for other ways to push my test VMs as hard as I could. To do this, I simply ensured the two test VMs were sitting on the same ESXi host, in the same portgroup. This removes all physical networking and allows the guests to communicate as quickly as possible without the constraints of physical networking components.

To run this test, I used two VMs with Debian Linux 7.4 (3.2 kernel) on them. Both are very minimal deployments with only the essentials. From a virtual hardware perspective, both have a single VMXNET3 adapter, two vCPUs and 1GB of RAM.

Although VMware Tools is installed in these guests, they are using the ‘-k’ distro bundled version of the VMXNET3 driver. The driver installed is 1.1.29.0.

root@iperf-test1:~# ethtool -i eth0
driver: vmxnet3
version: 1.1.29.0-k-NAPI
firmware-version: N/A
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

To generate large amounts of TCP traffic between the two machines, I used iperf 2.0.5 – a favorite that we use for network performance testing. The benefit of using iperf over other tools and methods is that it does not need to write or read anything from disk for the transfer – it simply sends/received TCP data to/from memory as quickly as it can.

Although I could have done a bi-directional test, I decided to use one machine as the sender and the other as the receiver. This helps to ensure one side is especially RX heavy.

root@iperf-test2:~# iperf -c 172.16.10.151 -t 300 -i 2 -P 12
------------------------------------------------------------
Client connecting to 172.16.10.151, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[ 14] local 172.16.10.150 port 45280 connected with 172.16.10.151 port 5001
[ 4] local 172.16.10.150 port 45269 connected with 172.16.10.151 port 5001
[ 5] local 172.16.10.150 port 45271 connected with 172.16.10.151 port 5001
[ 6] local 172.16.10.150 port 45273 connected with 172.16.10.151 port 5001
[ 7] local 172.16.10.150 port 45272 connected with 172.16.10.151 port 5001
[ 8] local 172.16.10.150 port 45274 connected with 172.16.10.151 port 5001
[ 3] local 172.16.10.150 port 45270 connected with 172.16.10.151 port 5001
[ 9] local 172.16.10.150 port 45275 connected with 172.16.10.151 port 5001
[ 10] local 172.16.10.150 port 45276 connected with 172.16.10.151 port 5001
[ 11] local 172.16.10.150 port 45277 connected with 172.16.10.151 port 5001
[ 12] local 172.16.10.150 port 45278 connected with 172.16.10.151 port 5001
[ 13] local 172.16.10.150 port 45279 connected with 172.16.10.151 port 5001
[ ID] Interval Transfer Bandwidth
[ 14] 0.0- 2.0 sec 918 MBytes 3.85 Gbits/sec
[ 4] 0.0- 2.0 sec 516 MBytes 2.16 Gbits/sec
[ 5] 0.0- 2.0 sec 603 MBytes 2.53 Gbits/sec
[ 6] 0.0- 2.0 sec 146 MBytes 614 Mbits/sec
[ 7] 0.0- 2.0 sec 573 MBytes 2.40 Gbits/sec
[ 8] 0.0- 2.0 sec 894 MBytes 3.75 Gbits/sec
[ 3] 0.0- 2.0 sec 596 MBytes 2.50 Gbits/sec
[ 9] 0.0- 2.0 sec 916 MBytes 3.84 Gbits/sec
[ 10] 0.0- 2.0 sec 548 MBytes 2.30 Gbits/sec
[ 11] 0.0- 2.0 sec 529 MBytes 2.22 Gbits/sec
[ 12] 0.0- 2.0 sec 930 MBytes 3.90 Gbits/sec
[ 13] 0.0- 2.0 sec 540 MBytes 2.26 Gbits/sec
[SUM] 0.0- 2.0 sec 7.53 GBytes 32.3 Gbits/sec

On the sending VM (iperf client machine) I used the -P 12 option to execute twelve parallel streams. This equates to twelve separate TCP/IP sockets and generally taxes the machine quite heavily. I also let the test run for a five minute period using the -t 300 option. As you can see above, thanks to offloading features like LRO, we’re seeing more than 32Gbps of throughput while on the same host and portgroup.

Now, although this appears to be excellent performance, it doesn’t mean there wasn’t packet loss experienced during the transfer. Packet loss also equates to TCP re-transmissions, window size adjustment and possibly performance impact. Depending on your application and the severity of the loss, you may not notice any problems, but I can pretty much guarantee that a packet capture would contain TCP duplicate ACKs and re-transmissions.

Let’s have a look at the TCP socket statistics from the sender’s perspective. Was the sender receiving duplicate ACKs and as a result re-transmitted?

root@iperf-test2:~# netstat -s |grep -i retransmit
 86715 segments retransmited
 TCPLostRetransmit: 106
 68040 fast retransmits
 122 forward retransmits
 18444 retransmits in slow start
 4 SACK retransmits failed
root@iperf-test2:~# netstat -s |grep -i loss
 645 times recovered from packet loss by selective acknowledgements
 2710 TCP data loss events

Indeed it was. There were over 86K re-transmitted segments and 2710 data loss events recorded by the guest’s TCP/IP stack. So now that we know there was application impact to some degree, let’s have a look at the other VM – the receiving side. We can use the ethtool command to view VMXNET3 driver statistics from within the guest:

root@iperf-test1:~# ethtool -S eth0 |grep -i drop
 drv dropped tx total: 0
 drv dropped tx total: 0
 drv dropped rx total: 2305
 drv dropped rx total: 979

Above we can see that there were over 3000 drops due to errors, but we’re more interested in buffering statistics. Searching for the ‘OOB’ string, we can see how many were dropped:

root@iperf-test1:~# ethtool -S eth0 |grep OOB
 pkts rx OOB: 7129
 pkts rx OOB: 3241

There are two outputs listed above because there are two RX queues in Linux – one for each vCPU on the VM. We’ll look more closely at this in the next section. Clearly, we dropped many frames due to buffering. Over 10,000 frames were dropped.

Checking Buffer Statistics from ESXi

Most Linux distros provide some good driver statistic information, but that may not always be the case. Thankfully, you can also check statistics from ESXi.

To begin, we’ll need to find both the internal port number and name of the connected vSwitch. To find this, the net-stats -l command is very useful:

[root@esx0:~] net-stats -l
PortNum Type SubType SwitchName MACAddress ClientName
<snip>
33554464 5 9 vSwitch0 00:50:56:a6:55:f4 iperf-test1
33554465 5 9 vSwitch0 00:50:56:a6:44:72 iperf-test2

Since iperf-test1 is the receiving iperf VM, I’ve made a note of the port number, which is 33554464 and the name of the vSwitch, which is vSwitch0. If your VM happens to be on a distributed switch, you’ll have an internal vSwitch name such as ‘DvsPortset-0’ and not the normal friendly label it’s given during setup.

Note: In the next few paragraphs, we’ll be using an internal debugging shell called ‘vsish’. This is an unsupported tool and should be used with caution. It’s safer to use single vsish -e commands to get information rather than trying to navigate around in the vsish shell.

To begin, we can get some generic vSwitch port statistics to see if any drops occurred. You’ll simply need to modify the below command to replace vSwitch0 with the name of your vSwitch and 33554464 with the port number you found earlier with net-stats -l.

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554464/clientStats
port client stats {
 pktsTxOK:26292806
 bytesTxOK:1736589508
 droppedTx:0
 pktsTsoTxOK:0
 bytesTsoTxOK:0
 droppedTsoTx:0
 pktsSwTsoTx:0
 droppedSwTsoTx:0
 pktsZerocopyTxOK:1460809
 droppedTxExceedMTU:0
 pktsRxOK:54807350
 bytesRxOK:1806670750824
 droppedRx:10312
 pktsSwTsoRx:26346080
 droppedSwTsoRx:0
 actions:0
 uplinkRxPkts:3401
 clonedRxPkts:0
 pksBilled:0
 droppedRxDueToPageAbsent:0
 droppedTxDueToPageAbsent:0
}

As you can see above, the droppedRx count is over 10K – about what we observed in the Linux guest. This tells us that frames were dropped, but not why.

Next, we’ll have a look at some statistics reported by the VMXNET3 adapter:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554464/vmxnet3/rxSummary
stats of a vmxnet3 vNIC rx queue {
 LRO pkts rx ok:50314577
 LRO bytes rx ok:1670451542658
 pkts rx ok:50714621
 bytes rx ok:1670920359206
 unicast pkts rx ok:50714426
 unicast bytes rx ok:1670920332742
 multicast pkts rx ok:0
 multicast bytes rx ok:0
 broadcast pkts rx ok:195
 broadcast bytes rx ok:26464
 running out of buffers:10370
 pkts receive error:0
 # of times the 1st ring is full:7086
 # of times the 2nd ring is full:3284
 fail to map a rx buffer:0
 request to page in a buffer:0
 # of times rx queue is stopped:0
 failed when copying into the guest buffer:0
 # of pkts dropped due to large hdrs:0
 # of pkts dropped due to max number of SG limits:0
}

And again, we see some more specific statistics that help us to understand why frames were dropped. Both the first and second rings were exhausted thousands of times.

Determining the Current Buffer Settings

The default does vary from OS to OS and can also vary depending on the VMXNET3 driver version being utilized. I believe that some versions of the Windows VMXNET3 driver also allow for dynamic sizing of the RX buffer based on load.

The ethtool command is useful for determining the current ring sizes in most Linux distros:

root@iperf-test1:~# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 512
RX Mini: 0
RX Jumbo: 0
TX: 1024

On the receiving node, we can see that the maximum possible value is 4096, but the current is 512. It’s important to note too that these settings are per RX queue. On this machine, there are actually two RX queues – one per vCPU – so that’s 256K per queue.

vmxnet3ring-1

In Windows, you can see the RX Ring and buffering settings in the network adapter properties window. Unfortunately, by default the value is just ‘Not Present’ indicating that it’s using the default of the driver.

Once again, you can see the current RX queue buffer size from ESXi and this value is generally more trustworthy.

First, we can display the number of RX queues being used by the guest by running the following command:

[root@esx0:~] vsish -e ls /net/portsets/vSwitch0/ports/33554464/vmxnet3/rxqueues/
0/
1/

Above, we can see that this Linux VM has two queues – zero and one. Each will have it’s own RX queue ring that can be viewed independently like this:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554464/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:256
 ring #2 size:256
}

My Windows 2008 R2 test box has only one RX queue despite having more than one vCPU. This is because Windows implements ‘multiqueue’ differently than Linux and it’s not used by default – more on this later. Despite this, we can see that the first ring is actually twice the size as the Linux VM by default:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554466/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:512
 ring #2 size:32
}

Increasing the RX Buffer Size in Linux

Now that we’ve determined that this guest could indeed benefit from a larger queue, let’s increase it to the maximum value of 4096.

Warning: Modifying NIC driver settings may cause a brief traffic disruption. If this is a production environment, be sure to do this in a scheduled outage/change window.

In Linux, there are more than one ways to accomplish this but the easiest is to use ethtool:

root@iperf-test1:~# ethtool -G eth0 rx 4096
root@iperf-test1:~# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 8192
RX Mini: 0
RX Jumbo: 0
TX: 2048

After setting it to 4096, we can see that the current hardware setting is actually 8192 (two RX queues of 4096K each).

Note: This ethtool setting will be lost as soon as the VM reboots. You’ll need to add this command to /etc/rc.local or some other startup script to ensure it persists across reboots.

From ESXi, we can also confirm that the setting took effect as we did previously:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554464/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:4096
 ring #2 size:256
}

Now we can see that queue 0 is set to 4096 as we wanted to see.

Increasing the RX Buffer Size in Windows

Making this change in Windows is a little different. VMware KB 2039495 outlines the process, but I’ll walk through it below.

Warning: Modifying NIC driver settings may cause a brief traffic disruption. If this is a production environment, be sure to do this in a scheduled outage/change window.

In theory, you can simply increase the RX Ring #1 size, but it’s also possible to boost the Small Rx Buffers that are used for other purposes.

From the network adapter properties page, I have increased Rx Ring #1 to 4096 and Small Rx Buffers to 8192.

vmxnet3ring-3

If you plan to use jumbo 9K frames in the guest, Windows can also benefit from a larger Rx Ring #2. It can be increased to 4096K, which I did also. The Large Rx Buffer value should also be maxed out if Rx Ring #2 is increased.

Once I did this, you can see that the values have taken effect from ESXi’s perspective:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554466/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:4096
 ring #2 size:4096
}

Enabling Multi-Queue Receive Side Scaling in Windows

As mentioned earlier, you’ll only have one RX queue by default with the VMXNET3 adapter in Windows. To take advantage of multiple queues, you’ll need to enable Receive Side Scaling. Again, this change will likely cause a momentary network ‘blip’ and impact existing TCP sessions. If this is a production VM, be sure to do this during a maintenance window.

This is done in the same advanced properties area:

vmxnet3ring-4

Note: There have been some issues reported over the years with VMXNET3 and RSS in Windows. I didn’t experience any issues with modern builds of ESXi and the VMXNET3 driver, but this should be enabled with caution and thorough performance testing should be conducted to ensure it’s having a positive benefit.

Once this was done, I could see two queues with the maximum ring sizes:

[root@esx0:~] vsish -e ls /net/portsets/vSwitch0/ports/33554466/vmxnet3/rxqueues/
0/
1/
[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554466/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:4096
 ring #2 size:4096
}

Measuring Improvements

Now that I’ve maxed out the RX buffers, I’ll be rebooting the guests to clear the counters and then repeating the test I ran earlier.

Although my testing methodology is far from precise, I did notice a slight performance increase in the test. Previously, I got about 32.5Gbps, now it’s consistently over 33Gbps. I suspect this improvement is due to a healthier TCP stream with fewer re-transmissions.

Let’s have a look:

root@iperf-test2:~# netstat -s |grep -i retransmit
 48 segments retransmited
 47 fast retransmits

That’s a huge improvement. We went from over 86,000 re-transmissions down to only a small handful. Next, let’s look at the VMXNET3 buffering:

root@iperf-test1:~# ethtool -S eth0 |grep -i OOB
 pkts rx OOB: 505
 pkts rx OOB: 1151

Although it’s not surprising that there was still some small amount of buffer exhaustion, these numbers are only about 10% of what they were previously. Let’s have a look from ESXi’s perspective:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554467/vmxnet3/rxSummary
stats of a vmxnet3 vNIC rx queue {
 LRO pkts rx ok:37113984
 LRO bytes rx ok:1242784172888
 pkts rx ok:37235828
 bytes rx ok:1242914664554
 unicast pkts rx ok:37235740
 unicast bytes rx ok:1242914656142
 multicast pkts rx ok:0
 multicast bytes rx ok:0
 broadcast pkts rx ok:88
 broadcast bytes rx ok:8412
 running out of buffers:1656
 pkts receive error:0
 # of times the 1st ring is full:0
 # of times the 2nd ring is full:1656
 fail to map a rx buffer:0
 request to page in a buffer:0
 # of times rx queue is stopped:0
 failed when copying into the guest buffer:0
 # of pkts dropped due to large hdrs:0
 # of pkts dropped due to max number of SG limits:0
}

As you can see above, all of the loss is now due to the second ring, which we did not increase. I hope to have another look at the second ring in a future post. Normally the second ring is used for jumbo frame traffic, but I’m not clear why my guests are using it as my MTU is set to 1500. The first ring can clearly handle the deluge of packets now and didn’t exhaust once during the test.

Memory Overhead Impact

VMware’s KB 2039495 mentions increased overhead with larger receive buffers. This isn’t surprising as the guest OS needs to pre-allocate memory to use for this purpose.

Let’s have a look at what kind of increase happened on my Linux VM. To do this, I did a fresh reboot of the VM with the default RX ring, waited five minutes for things to settle and recorded the memory utilization. I then repeated this with the maxed out RX buffer.
After five minutes with the default ring:

root@iperf-test1:~# free
             total       used       free     shared    buffers     cached
Mem:       1034088      74780     959308          0       9420      35352
-/+ buffers/cache:      30008    1004080
Swap:       265212          0     265212

And again after increasing the RX ring to 4096K:

root@iperf-test1:~# free
             total       used       free     shared    buffers     cached
Mem:       1034088      93052     941036          0       9444      35352
-/+ buffers/cache:      48256     985832
Swap:       265212          0     265212

Although ~18MB may not seem like a lot of overhead, that could certainly add up over hundreds or thousands of VMs. Obviously ESXi’s memory sharing and conservation techniques will help to reduce this burden, but the key point to remember is that this extra buffering is not free from a resource perspective.

A good rule of thumb I like to tell customers is that increasing RX buffers is a great idea – just as long as a VM will actually benefit from it. The default values are probably sufficient for the vast majority of VM workloads, but if you have a VM exhibiting buffer exhaustion, there is no reason not to boost it up. I’d also go so far as to say that if you have a particular VM that you know will be traffic heavy – perhaps a new SQL box, or file server – proactively boost the buffers to the maximum possible.

Frequently Asked Questions

Q: Why are VMs more susceptible to buffer exhaustion and I don’t see these types of issues with physical servers?

A: This generally comes down to compute resources. If a VM – or physical server for that matter – can quickly process incoming frames, it’s unlikely that the buffer will get to a point where it’s full. When you have dozens or hundreds of VMs on a host all competing for compute resources, the guest may not be able to act on incoming frames quickly enough and the buffer can fill. A guest’s processing abilities may vary greatly from one moment to the next as well, which increase the risk of exhaustion.

Q: Shouldn’t TCP window scaling prevent packet loss?

A: That is mostly correct – TCP will scale the flow of segments based on network conditions, but because the loss of TCP segments is the trigger for scaling back, it’s quite likely that the buffer had to be exhausted at least once already before TCP starts reducing window size. Because a VM’s processing capability can vary due to the shared nature of resources on the hypervisor, what was fine from a TCP perspective one moment may be too heavy in the next.

Q: Does having a larger receive buffer have any disadvantages?

A: This will depend on the type of applications you are using in the guest. Having a larger buffer means that more frames can queue up. Rather than dropping frames, some frames may still make it to the guest but latency may be slightly increased. In some real-time applications like voice or video, this may not be desirable and packet loss is preferred. That said, most transaction based workloads like DB, email and file services would benefit from a larger buffer.

Q: What about increased overhead?

A: As mentioned, there is a small amount of memory overhead that the guest will use for the increased buffering. This is generally insignificant unless it’s increased across the entire environment. I generally recommend only increasing the RX buffers on VMs that will actually benefit from it.

Q: Can’t the physical NICs on the ESXi host contribute to packet loss as well?

A: They certainly can. Physical NICs on the ESXi hypervisor also have RX queues. These will vary from vendor to vendor and are sometimes configurable in the NIC driver parameters. Most physical NIC drivers are tuned appropriately for heavy traffic bursts and it isn’t usual to experience any amount of significant loss due to NIC buffers on modern hardware.

Q: Why does my Windows VM have only one RX queue?

A: The Windows VMXNET3 driver has RSS (Receive Side Scaling) disabled by default. Most modern Linux kernels will enable multiqueue support out of the box, but in Windows this will need to be turned on. Be sure to test thoroughly that RSS works correctly and that you see performance benefit.

Q: Is there any impact to existing flows when modifying RX buffers?

A: It’s always best to assume that there will be a traffic impact when modifying any driver settings like RX ring sizes. I can confirm that Windows disables/enables the adapter briefly when changing advanced VMXNET3 settings. The outage may be brief, but you’ll want to proceed with caution in a production environment.

Q: What if I’m using an E1000 adapter or something other than VMXNET3?

A: E1000 and other adapter types will often allow the tweaking of buffers as described in VMware KB 1010071. VMXNET3 has the largest configurable RX buffer sizes available of all the adapters and many other benefits. Unless there is a very specific reason for using an E1000 or other type of adapter, you should really consider moving to VMXNET3.

Conclusion

And there you have it – an often misunderstood setting that can help to mitigate packet loss and improve performance. Please feel free to leave any questions or comments below.