NSX 6.3.4 Now Available!

On Friday October 13th, VMware released NSX for vSphere 6.3.4. You may be surprised to see another 6.3.x version only two months after the release of 6.3.3. Unlike the usual build updates, 6.3.4 is a maintenance release containing only a small number of fixes for problems identified in 6.3.3. This is very similar to the 6.2.6 maintenance release that came out shortly after 6.2.5.

As always, the relevant detail can be found in the 6.3.4 Release Notes. You can also find the 6.3.4 upgrade bundle at the VMware NSX Download Page.

In the Resolved Issues section of the release notes, VMware outlines only three separate fixes that 6.3.4 addresses.

Resolved Issues

I’ll provide a bit of additional commentary around each of the resolved issues in 6.3.4:

Fixed Issue 1970527: ARP fails to resolve for VMs when Logical Distributed Router ARP table crosses 5K limit

This first problem was actually a regression in 6.3.3. In a previous release, the ARP table limit was increased to 20K, but in 6.3.3 the limit regressed back to previous limit of 5K. To be honest, not many customers have deployments to the scale where this would be a problem. A small number of very large deployments may see issues in 6.3.3.

Fixed Issue 1961105: Hardware VTEP connection goes down upon controller reboot. A BufferOverFlow exception is seen when certain hardware VTEP configurations are pushed from the NSX Manager to the NSX Controller. This overflow issue prevents the NSX Controller from getting a complete hardware gateway configuration. Fixed in 6.3.4.

This buffer overflow issue could potentially cause datapath issues. Thankfully, not very many NSX designs include the use of Hardware VTEPs, but if yours does and you are running 6.3.3, it would be a good idea to consider upgrading to 6.3.4.

And the final, but most likely to impact customer’s is listed third in the release notes:

Fixed Issue 1955855: Controller API could fail due to cleanup of API server reference files. Upon cleanup of required files, workflows such as traceflow and central CLI will fail. If external events disrupt the persistent TCP connections between NSX Manager and controller, NSX Manager will lose the ability to make API connections to controllers, and the UI will display the controllers as disconnected. There is no datapath impact. Fixed in 6.3.4.

I discussed this issue in more detail in a recent blog post. You can also find more information on this issue in VMware KB 2151719. In a nutshell, the communication channel between NSX Manager and the NSX Control cluster can become disrupted due to files being periodically purged by a cleanup maintenance script. Usually, you wouldn’t notice until the connection needed to be re-established after a network outage or an NSX manager reboot. Thankfully, as VMware mentions, there is no datapath impact and a simple workaround exists. Despite being more of an annoyance than a serious problem, the vast majority of NSX users running 6.3.3 are likely to hit this at one time or another.

My Opinion and Upgrade Recommendations

The third issue in the release notes described in VMware KB 2151719 is likely the most disruptive to the majority of NSX users. That said, I really don’t think it’s critical enough to have to drop everything and upgrade immediately. The workaround of restarting the controller API service is relatively simple and there should be no resulting datapath impact.

The other two issues described are not likely to be encountered in the vast majority of NSX deployments, but are potentially more serious. Unless you are really pushing the scale limits or are using Hardware VTEPs, there is likely little reason to be concerned.

I certainly think that VMware did the right thing to patch these identified problems as quickly as possible. For new greenfield deployments, I think there is no question that 6.3.4 is the way to go. For those already running 6.3.3, it’s certainly not a bad idea to upgrade, but you may want to consider holding out for 6.3.5, which should include a much larger number of fixes.

On a positive note, if you do decide to upgrade, there are likely some components that will not need to be upgraded. Because there are only a small number or fixes relating to the control plane and logical switching, ESGs, DLRs and Guest Introspection will likely not have any code changes. You’ll also benefit from not having to reboot ESXi hosts for VIB patches thanks to changes in the 6.3.x upgrade process. Once I have a chance to go through the upgrade in my lab, I’ll report back on this.

Running 6.3.3 today? Let me know what your plans are!

Building a Retro Gaming Rig – Part 3

Welcome to the third installment of my Building a Retro Gaming Rig series. Today, I’ll be taking a look at another motherboard and CPU combo that I picked up from eBay on a bit of a whim.

In Part 1 of this series, I took an in-depth look at some Slot-1 gear, including the popular Asus P2B and some CPU options. As I was thinking ahead in the build, I got frustrated with the lack of simple and classic-looking ATX tower cases available these days. Everything looks far too modern, has too much bling or is just plain gigantic. Used tower cases from twenty years ago are all yellowed pretty badly and just look bad. On the other hand, there are lots of small, simple and affordable micro ATX cases available.

Micro ATX – or mATX – motherboards were actually pretty uncommon twenty-odd years ago. PC tower cases were pretty large and in those days people really did use lots of expansion cards and needed the extra space. Only very compact systems and OEMs seemed to use the mATX form factor at that time. Many of these boards were heavily integrated, lacked expansion slots and stuck you with some pretty weak onboard video solutions.

MSI MS-6160 Motherboard

In an interesting twist, I came across an MSI MS-6160 mATX board based on the Intel 440LX chipset that seemed to tick many of the right boxes. The combo included a Celeron 400MHz processor and 512MB of SDRAM for only $35 CDN.

Continue reading

VM Network Performance and CPU Scheduling

Over the years, I’ve been on quite a few network performance cases and have seen many reasons for performance trouble. One that is often overlooked is the impact of CPU contention and a VM’s inability to schedule CPU time effectively.

Today, I’ll be taking a quick look at the actual impact CPU scheduling can have on network throughput.

Testing Setup

To demonstrate, I’ll be using my dual-socket management host. As I did in my recent VMXNET3 ring buffer exhaustion post, I’ll be testing with VMs on the same host and port group to eliminate bottlenecks created by physical networking components. The VMs should be able to communicate as quickly as their compute resources will allow them.

Physical Host:

  • 2x Intel Xeon E5 2670 Processors (16 cores at 2.6GHz, 3.3GHz Turbo)
  • 96GB PC3-12800R Memory
  • ESXi 6.0 U3 Build 5224934

VM Configuration:

  • 1x vCPU
  • 1024MB RAM
  • VMXNET3 Adapter (1.1.29 driver with default ring sizes)
  • Debian Linux 7.4 x86 PAE
  • iperf 2.0.5

The VMs I used for this test are quite small with only a single vCPU and 1GB of RAM. This was done intentionally so that CPU contention could be more easily simulated. Much higher throughput would be possible with multiple vCPUs and additional RX queues.

The CPUs in my physical host are Xeon E5 2670 processors clocked at 2.6GHz per core. Because this processor supports Intel Turbo Boost, the maximum frequency of each core will vary depending on several factors and can be as high as 3.3GHz at times. To take this into consideration, I will test with a CPU limit of 2600MHz, as well as with no limit at all to show the benefit this provides.

To measure throughput, I’ll be using a pair of Debian Linux VMs running iperf 2.0.5. One will be the sending side and the other the receiving side. I’ll be running four simultaneous threads to maximize throughput and load.

I should note that my testing is far from precise and is not being done with the usual controls and safeguards to ensure accurate results. This said, my aim isn’t to be accurate, but rather to illustrate some higher-level patterns and trends.

Simulating CPU Contention

The E5 2670 processors in my host have plenty of power for home lab purposes, and I’m nowhere near CPU contention during day-to-day use. An easy way to simulate contention or CPU scheduling difficulty would be to manually set a CPU ‘Limit’ in the VM’s resource settings. In my tests, this seemed to be quite effective.

cpusched-2

The challenge I had was what appeared to be a brief period of CPU burst that occurs before the set limit is enforced. For the first 10 seconds of load, the limit doesn’t appear to work and then suddenly kicks in. To circumvent this, I ran an iperf test until I saw the throughput drop, cancelled it, and then started the test again immediately for the full duration. This ensured that the entire test was done while CPU limited.

With an enforced limit, the receiving VM (iperf-test2) simply couldn’t get sufficient scheduling time with the host’s CPU during load. CPU ready time was very high as a result.

Below is esxtop output illustrating the effect:

   ID      GID    NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT

 5572782  5572782 iperf-test2         6   66.50   32.82   28.20  571.70    0.17    0.12   67.73    0.01    0.00    0.00    0.00
 5572767  5572767 iperf-test1         6   37.77   29.20    3.56  505.83    0.13   69.60    3.19    0.02    0.00   69.52    0.00
   19840    19840 NSX_Controller_     9   11.60   12.63    0.36  894.20    0.00    0.17  391.13    0.04    0.00    0.00    0.00
   19825    19825 NSX_Controller_     9    9.22   10.43    0.11  896.38    0.00    0.19  392.96    0.08    0.00    0.00    0.00
   20529    20529 vc                 10    9.13   11.97    0.14  995.69    0.53    0.14  189.26    0.03    0.00    0.00    0.00
<snip>

With a CPU limit of 800MHz, we can see a very high CPU %RDY time of about 69%. This tells us that the guest was ready to schedule work for the CPU to do, but 69% of the time it couldn’t get scheduling time.

Let’s have a look at the test results.

Interpreting The Results

To simulate varying degrees of contention, I re-ran the test described in the previous section after decreasing the CPU limit in 200MHz intervals. Below is the result:

cpusched-1

The actual results including the CPU limit set is as follows:

No Limit (Full Frequency): 20.2 Gbps
2600MHz (100%): 19.2 Gbps
2400MHz (92%): 17.7 Gbps
2200MHz (85%): 16.4 Gbps
2000MHz (77%): 14.7 Gbps
1800MHz (69%): 12.1 Gbps
1600MHz (62%): 10.3 Gbps
1400MHz (54%): 8.5 Gbps
1200MHz (46%): 6.5 Gbps
1000MHz (38%): 5.2 Gbps
800MHz (31%): 4.1 Gbps
600MHz (23%): 2.7 Gbps
400MHz (15%): 2.2 Gbps

The end result was surprisingly linear and can tell us several important facts about this guest.

  1. CPU was clearly a bottleneck for this guest to process frames. Throughput continues to scale as CPU scheduling time increased.
  2. This guest probably wouldn’t have difficulty with 10Gbps networking throughput levels until it’s CPU scheduling ability dropped below 70%.
  3. With Gigabit networking, this guest had plenty of cycles to process frames.

Realistically, this guest would be suffering from many other performance issues with CPU %RDY times that high, but it’s interesting to see it’s impact on the network stack and with frame processing.

Conclusion

My tests were far from accurate, but they help to illustrate the importance of having ample CPU scheduling time for high network throughput. If your benchmarks or throughput levels aren’t where they should be, you should definitely have a look at your CPU %RDY times.

Stay tuned for other tests. In a future post, I hope to examine the role offloading features like LRO and TSO play in throughput testing. Please feel free to leave any questions or comments below!

NSX Engineering Mode ‘root shell’ Access Now Available to Customers

In an interesting move, VMware has released public KB 2149630 on September 29th, providing information on how to access the root shell of the NSX Manager appliance.

If you’ve been on an NSX support call with VMware dealing with a complex issue, you may have seen your support engineer drop into a special shell called ‘Engineering Mode’. This is sometimes also referred to as ‘Tech Support Mode’. Regardless of the name used, this is basically a root bash shell on the underlying Linux based appliance. From here, system configuration files and scripts as well as most normal Linux functions can be accessed.

Normally, when you open a console or SSH session to NSX manager, you are dropped into a restricted ‘admin’ shell with a hierarchical system of commands like Cisco’s IOS. For the majority of what an administrator needs to do, this is sufficient. It’s only in more complex cases – especially when dealing with issues in the Postgres DB – or issues with the underling OS that this may be required.

There are several important statements and disclaimers that VMware makes in this KB article that I want to outline below:

“Important: Do not make any changes to the underlying system without the help of VMware Technical Support. All such changes are not supported and as a result, your system may no longer be supportable by GSS.”

In NSX 6.3.2 and later, you’ll also be greeted by the following disclaimer:

“Engineering Mode: The authorized NSX Manager system administrator is requesting a shell which is able to perform lower level unix commands/diagnostics and make changes to the appliance. VMware asks that you do so only in conjunction with a support call to prevent breaking your virtual infrastructure. Please enter the shell diagnostics string before proceeding.Type Exit to return to the NSX shell. Type y to continue:”

And finally, you’ll want to ensure you have a full backup of NSX Manager should anything need to be modified:

VMware recommends to take full backup of the system before performing any changes after logging into the Tech Support Mode.

Although it is very useful to take a ‘read only’ view at some things in the root shell, making any changes is not supported without getting direct assistance from VMware support.

A few people have asked whether or not making the root shell password public is a security issue, but the important point to remember is that you cannot even get to a position where you can enter the shell unless you are already logged in as an NSX enterprise administrator level account. For example, the built-in ‘admin’ account. For anyone concerned about this, VMware does allow the root password to be changed. It’s just critical that this password not be lost in case VMware support requires access to the root shell for troubleshooting purposes. More information on this can be found in KB 2149630.

To be honest, I’m a bit torn on this development. As someone who does backline support, I know what kind of damage that can be done from the root shell – even with the best intentions. But at the same time, I see this as empowering. It gives customers additional tools to troubleshoot and it also provides some transparency into how NSX Manager works rather than shielding it behind a restricted shell. I think that overall, the benefits outweigh the risks and this was a positive move for VMware.

When I think back to VI 3.5 and vSphere 4.0 when ESXi was shiny and new, VMware initially took a similar stance. You had to go so far as to type ‘UNSUPPORTED’ into the console to access a shell. Today, everyone has unrestricted root access to the hypervisor. The same holds true for the vCenter appliance – the potential for destruction is no different.

I’d welcome any comments or thoughts. Please share them below!

Controller Disconnect and API Bug in NSX 6.3.3

VMware just announced a new bug discovered in NSX 6.3.3. Those running 6.3.3 or planning to upgrade in the near-term may want to familiarize themselves with VMware KB 2151719.

As you may know, VMware moved from a Debian based distribution for the underlying OS of the NSX controllers to their Photon OS platform. This is why the upgrade process includes the complete redeployment of all three controller nodes.

It appears that a scheduled clean-up script on the controllers used to prevent the partitions from filling is also removing some files required for NSX Manager to communicate and authenticate with the controller via REST API.

Most folks running 6.3.3 in a stable deployment will likely not have noticed, but an event disrupting communication between Manager and the Controllers can prevent them from reconnecting. Some examples would include a reboot of the NSX manager, or a network disruption.

Thankfully, the NSX Controller core functions – managing the VXLAN and distributed logical routing control plane – will continue to work in this state and dataplane disruptions should not be experienced.

KB 2151719 discusses a pretty simple workaround of restarting the api-server service on any impacted controllers. This is a non-disruptive action and should be safe to do at any time. The command is the following:

nsx-controller # restart api-server

VMware will likely be addressing this in the next NSX release. If you are planning to upgrade, you may want to consider 6.3.2 or hold out for the next 6.3.x release.

VMXNET3 RX Ring Buffer Exhaustion and Packet Loss

ESXi is generally very efficient when it comes to basic network I/O processing. Guests are able to make good use of the physical networking resources of the hypervisor and it isn’t unreasonable to expect close to 10Gbps of throughput from a VM on modern hardware. Dealing with very network heavy guests, however, does sometimes require some tweaking.

I’ll quite often get questions from customers who observe TCP re-transmissions and other signs of packet loss when doing VM packet captures. The loss may not be significant enough to cause a real application problem, but may have some performance impact during peak times and during heavy load.

After doing some searching online, customers will quite often land on VMware KB 2039495 and KB 1010071 but there isn’t a lot of context and background to go with these directions. Today I hope to take an in-depth look at VMXNET3 RX buffer exhaustion and not only show how to increase buffers, but to also to determine if it’s even necessary.

Rx Buffering

Not unlike physical network cards and switches, virtual NICs must have buffers to temporarily store incoming network frames for processing. During periods of very heavy load, the guest may not have the cycles to handle all the incoming frames and the buffer is used to temporarily queue up these frames. If that buffer fills more quickly than it is emptied, the vNIC driver has no choice but to drop additional incoming frames. This is what is known as buffer or ring exhaustion.

A Lab Example

To demonstrate ring exhaustion in my lab, I had to get a bit creative. There weren’t any easily reproducible ways to do this with 1Gbps networking, so I looked for other ways to push my test VMs as hard as I could. To do this, I simply ensured the two test VMs were sitting on the same ESXi host, in the same portgroup. This removes all physical networking and allows the guests to communicate as quickly as possible without the constraints of physical networking components.

To run this test, I used two VMs with Debian Linux 7.4 (3.2 kernel) on them. Both are very minimal deployments with only the essentials. From a virtual hardware perspective, both have a single VMXNET3 adapter, two vCPUs and 1GB of RAM.

Although VMware Tools is installed in these guests, they are using the ‘-k’ distro bundled version of the VMXNET3 driver. The driver installed is 1.1.29.0.

root@iperf-test1:~# ethtool -i eth0
driver: vmxnet3
version: 1.1.29.0-k-NAPI
firmware-version: N/A
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

To generate large amounts of TCP traffic between the two machines, I used iperf 2.0.5 – a favorite that we use for network performance testing. The benefit of using iperf over other tools and methods is that it does not need to write or read anything from disk for the transfer – it simply sends/received TCP data to/from memory as quickly as it can.

Although I could have done a bi-directional test, I decided to use one machine as the sender and the other as the receiver. This helps to ensure one side is especially RX heavy.

root@iperf-test2:~# iperf -c 172.16.10.151 -t 300 -i 2 -P 12
------------------------------------------------------------
Client connecting to 172.16.10.151, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[ 14] local 172.16.10.150 port 45280 connected with 172.16.10.151 port 5001
[ 4] local 172.16.10.150 port 45269 connected with 172.16.10.151 port 5001
[ 5] local 172.16.10.150 port 45271 connected with 172.16.10.151 port 5001
[ 6] local 172.16.10.150 port 45273 connected with 172.16.10.151 port 5001
[ 7] local 172.16.10.150 port 45272 connected with 172.16.10.151 port 5001
[ 8] local 172.16.10.150 port 45274 connected with 172.16.10.151 port 5001
[ 3] local 172.16.10.150 port 45270 connected with 172.16.10.151 port 5001
[ 9] local 172.16.10.150 port 45275 connected with 172.16.10.151 port 5001
[ 10] local 172.16.10.150 port 45276 connected with 172.16.10.151 port 5001
[ 11] local 172.16.10.150 port 45277 connected with 172.16.10.151 port 5001
[ 12] local 172.16.10.150 port 45278 connected with 172.16.10.151 port 5001
[ 13] local 172.16.10.150 port 45279 connected with 172.16.10.151 port 5001
[ ID] Interval Transfer Bandwidth
[ 14] 0.0- 2.0 sec 918 MBytes 3.85 Gbits/sec
[ 4] 0.0- 2.0 sec 516 MBytes 2.16 Gbits/sec
[ 5] 0.0- 2.0 sec 603 MBytes 2.53 Gbits/sec
[ 6] 0.0- 2.0 sec 146 MBytes 614 Mbits/sec
[ 7] 0.0- 2.0 sec 573 MBytes 2.40 Gbits/sec
[ 8] 0.0- 2.0 sec 894 MBytes 3.75 Gbits/sec
[ 3] 0.0- 2.0 sec 596 MBytes 2.50 Gbits/sec
[ 9] 0.0- 2.0 sec 916 MBytes 3.84 Gbits/sec
[ 10] 0.0- 2.0 sec 548 MBytes 2.30 Gbits/sec
[ 11] 0.0- 2.0 sec 529 MBytes 2.22 Gbits/sec
[ 12] 0.0- 2.0 sec 930 MBytes 3.90 Gbits/sec
[ 13] 0.0- 2.0 sec 540 MBytes 2.26 Gbits/sec
[SUM] 0.0- 2.0 sec 7.53 GBytes 32.3 Gbits/sec

On the sending VM (iperf client machine) I used the -P 12 option to execute twelve parallel streams. This equates to twelve separate TCP/IP sockets and generally taxes the machine quite heavily. I also let the test run for a five minute period using the -t 300 option. As you can see above, thanks to offloading features like LRO, we’re seeing more than 32Gbps of throughput while on the same host and portgroup.

Now, although this appears to be excellent performance, it doesn’t mean there wasn’t packet loss experienced during the transfer. Packet loss also equates to TCP re-transmissions, window size adjustment and possibly performance impact. Depending on your application and the severity of the loss, you may not notice any problems, but I can pretty much guarantee that a packet capture would contain TCP duplicate ACKs and re-transmissions.

Let’s have a look at the TCP socket statistics from the sender’s perspective. Was the sender receiving duplicate ACKs and as a result re-transmitted?

root@iperf-test2:~# netstat -s |grep -i retransmit
 86715 segments retransmited
 TCPLostRetransmit: 106
 68040 fast retransmits
 122 forward retransmits
 18444 retransmits in slow start
 4 SACK retransmits failed
root@iperf-test2:~# netstat -s |grep -i loss
 645 times recovered from packet loss by selective acknowledgements
 2710 TCP data loss events

Indeed it was. There were over 86K re-transmitted segments and 2710 data loss events recorded by the guest’s TCP/IP stack. So now that we know there was application impact to some degree, let’s have a look at the other VM – the receiving side. We can use the ethtool command to view VMXNET3 driver statistics from within the guest:

root@iperf-test1:~# ethtool -S eth0 |grep -i drop
 drv dropped tx total: 0
 drv dropped tx total: 0
 drv dropped rx total: 2305
 drv dropped rx total: 979

Above we can see that there were over 3000 drops due to errors, but we’re more interested in buffering statistics. Searching for the ‘OOB’ string, we can see how many were dropped:

root@iperf-test1:~# ethtool -S eth0 |grep OOB
 pkts rx OOB: 7129
 pkts rx OOB: 3241

There are two outputs listed above because there are two RX queues in Linux – one for each vCPU on the VM. We’ll look more closely at this in the next section. Clearly, we dropped many frames due to buffering. Over 10,000 frames were dropped.

Checking Buffer Statistics from ESXi

Most Linux distros provide some good driver statistic information, but that may not always be the case. Thankfully, you can also check statistics from ESXi.

To begin, we’ll need to find both the internal port number and name of the connected vSwitch. To find this, the net-stats -l command is very useful:

[root@esx0:~] net-stats -l
PortNum Type SubType SwitchName MACAddress ClientName
<snip>
33554464 5 9 vSwitch0 00:50:56:a6:55:f4 iperf-test1
33554465 5 9 vSwitch0 00:50:56:a6:44:72 iperf-test2

Since iperf-test1 is the receiving iperf VM, I’ve made a note of the port number, which is 33554464 and the name of the vSwitch, which is vSwitch0. If your VM happens to be on a distributed switch, you’ll have an internal vSwitch name such as ‘DvsPortset-0’ and not the normal friendly label it’s given during setup.

Note: In the next few paragraphs, we’ll be using an internal debugging shell called ‘vsish’. This is an unsupported tool and should be used with caution. It’s safer to use single vsish -e commands to get information rather than trying to navigate around in the vsish shell.

To begin, we can get some generic vSwitch port statistics to see if any drops occurred. You’ll simply need to modify the below command to replace vSwitch0 with the name of your vSwitch and 33554464 with the port number you found earlier with net-stats -l.

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554464/clientStats
port client stats {
 pktsTxOK:26292806
 bytesTxOK:1736589508
 droppedTx:0
 pktsTsoTxOK:0
 bytesTsoTxOK:0
 droppedTsoTx:0
 pktsSwTsoTx:0
 droppedSwTsoTx:0
 pktsZerocopyTxOK:1460809
 droppedTxExceedMTU:0
 pktsRxOK:54807350
 bytesRxOK:1806670750824
 droppedRx:10312
 pktsSwTsoRx:26346080
 droppedSwTsoRx:0
 actions:0
 uplinkRxPkts:3401
 clonedRxPkts:0
 pksBilled:0
 droppedRxDueToPageAbsent:0
 droppedTxDueToPageAbsent:0
}

As you can see above, the droppedRx count is over 10K – about what we observed in the Linux guest. This tells us that frames were dropped, but not why.

Next, we’ll have a look at some statistics reported by the VMXNET3 adapter:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554464/vmxnet3/rxSummary
stats of a vmxnet3 vNIC rx queue {
 LRO pkts rx ok:50314577
 LRO bytes rx ok:1670451542658
 pkts rx ok:50714621
 bytes rx ok:1670920359206
 unicast pkts rx ok:50714426
 unicast bytes rx ok:1670920332742
 multicast pkts rx ok:0
 multicast bytes rx ok:0
 broadcast pkts rx ok:195
 broadcast bytes rx ok:26464
 running out of buffers:10370
 pkts receive error:0
 # of times the 1st ring is full:7086
 # of times the 2nd ring is full:3284
 fail to map a rx buffer:0
 request to page in a buffer:0
 # of times rx queue is stopped:0
 failed when copying into the guest buffer:0
 # of pkts dropped due to large hdrs:0
 # of pkts dropped due to max number of SG limits:0
}

And again, we see some more specific statistics that help us to understand why frames were dropped. Both the first and second rings were exhausted thousands of times.

Determining the Current Buffer Settings

The default does vary from OS to OS and can also vary depending on the VMXNET3 driver version being utilized. I believe that some versions of the Windows VMXNET3 driver also allow for dynamic sizing of the RX buffer based on load.

The ethtool command is useful for determining the current ring sizes in most Linux distros:

root@iperf-test1:~# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 512
RX Mini: 0
RX Jumbo: 0
TX: 1024

On the receiving node, we can see that the maximum possible value is 4096, but the current is 512. It’s important to note too that these settings are per RX queue. On this machine, there are actually two RX queues – one per vCPU – so that’s 256K per queue.

vmxnet3ring-1

In Windows, you can see the RX Ring and buffering settings in the network adapter properties window. Unfortunately, by default the value is just ‘Not Present’ indicating that it’s using the default of the driver.

Once again, you can see the current RX queue buffer size from ESXi and this value is generally more trustworthy.

First, we can display the number of RX queues being used by the guest by running the following command:

[root@esx0:~] vsish -e ls /net/portsets/vSwitch0/ports/33554464/vmxnet3/rxqueues/
0/
1/

Above, we can see that this Linux VM has two queues – zero and one. Each will have it’s own RX queue ring that can be viewed independently like this:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554464/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:256
 ring #2 size:256
}

My Windows 2008 R2 test box has only one RX queue despite having more than one vCPU. This is because Windows implements ‘multiqueue’ differently than Linux and it’s not used by default – more on this later. Despite this, we can see that the first ring is actually twice the size as the Linux VM by default:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554466/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:512
 ring #2 size:32
}

Increasing the RX Buffer Size in Linux

Now that we’ve determined that this guest could indeed benefit from a larger queue, let’s increase it to the maximum value of 4096.

Warning: Modifying NIC driver settings may cause a brief traffic disruption. If this is a production environment, be sure to do this in a scheduled outage/change window.

In Linux, there are more than one ways to accomplish this but the easiest is to use ethtool:

root@iperf-test1:~# ethtool -G eth0 rx 4096
root@iperf-test1:~# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 8192
RX Mini: 0
RX Jumbo: 0
TX: 2048

After setting it to 4096, we can see that the current hardware setting is actually 8192 (two RX queues of 4096K each).

Note: This ethtool setting will be lost as soon as the VM reboots. You’ll need to add this command to /etc/rc.local or some other startup script to ensure it persists across reboots.

From ESXi, we can also confirm that the setting took effect as we did previously:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554464/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:4096
 ring #2 size:256
}

Now we can see that queue 0 is set to 4096 as we wanted to see.

Increasing the RX Buffer Size in Windows

Making this change in Windows is a little different. VMware KB 2039495 outlines the process, but I’ll walk through it below.

Warning: Modifying NIC driver settings may cause a brief traffic disruption. If this is a production environment, be sure to do this in a scheduled outage/change window.

In theory, you can simply increase the RX Ring #1 size, but it’s also possible to boost the Small Rx Buffers that are used for other purposes.

From the network adapter properties page, I have increased Rx Ring #1 to 4096 and Small Rx Buffers to 8192.

vmxnet3ring-3

If you plan to use jumbo 9K frames in the guest, Windows can also benefit from a larger Rx Ring #2. It can be increased to 4096K, which I did also. The Large Rx Buffer value should also be maxed out if Rx Ring #2 is increased.

Once I did this, you can see that the values have taken effect from ESXi’s perspective:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554466/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:4096
 ring #2 size:4096
}

Enabling Multi-Queue Receive Side Scaling in Windows

As mentioned earlier, you’ll only have one RX queue by default with the VMXNET3 adapter in Windows. To take advantage of multiple queues, you’ll need to enable Receive Side Scaling. Again, this change will likely cause a momentary network ‘blip’ and impact existing TCP sessions. If this is a production VM, be sure to do this during a maintenance window.

This is done in the same advanced properties area:

vmxnet3ring-4

Note: There have been some issues reported over the years with VMXNET3 and RSS in Windows. I didn’t experience any issues with modern builds of ESXi and the VMXNET3 driver, but this should be enabled with caution and thorough performance testing should be conducted to ensure it’s having a positive benefit.

Once this was done, I could see two queues with the maximum ring sizes:

[root@esx0:~] vsish -e ls /net/portsets/vSwitch0/ports/33554466/vmxnet3/rxqueues/
0/
1/
[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554466/vmxnet3/rxqueues/0/status
status of a vmxnet3 vNIC rx queue {
 intr index:0
 stopped:0
 error code:0
 ring #1 size:4096
 ring #2 size:4096
}

Measuring Improvements

Now that I’ve maxed out the RX buffers, I’ll be rebooting the guests to clear the counters and then repeating the test I ran earlier.

Although my testing methodology is far from precise, I did notice a slight performance increase in the test. Previously, I got about 32.5Gbps, now it’s consistently over 33Gbps. I suspect this improvement is due to a healthier TCP stream with fewer re-transmissions.

Let’s have a look:

root@iperf-test2:~# netstat -s |grep -i retransmit
 48 segments retransmited
 47 fast retransmits

That’s a huge improvement. We went from over 86,000 re-transmissions down to only a small handful. Next, let’s look at the VMXNET3 buffering:

root@iperf-test1:~# ethtool -S eth0 |grep -i OOB
 pkts rx OOB: 505
 pkts rx OOB: 1151

Although it’s not surprising that there was still some small amount of buffer exhaustion, these numbers are only about 10% of what they were previously. Let’s have a look from ESXi’s perspective:

[root@esx0:~] vsish -e get /net/portsets/vSwitch0/ports/33554467/vmxnet3/rxSummary
stats of a vmxnet3 vNIC rx queue {
 LRO pkts rx ok:37113984
 LRO bytes rx ok:1242784172888
 pkts rx ok:37235828
 bytes rx ok:1242914664554
 unicast pkts rx ok:37235740
 unicast bytes rx ok:1242914656142
 multicast pkts rx ok:0
 multicast bytes rx ok:0
 broadcast pkts rx ok:88
 broadcast bytes rx ok:8412
 running out of buffers:1656
 pkts receive error:0
 # of times the 1st ring is full:0
 # of times the 2nd ring is full:1656
 fail to map a rx buffer:0
 request to page in a buffer:0
 # of times rx queue is stopped:0
 failed when copying into the guest buffer:0
 # of pkts dropped due to large hdrs:0
 # of pkts dropped due to max number of SG limits:0
}

As you can see above, all of the loss is now due to the second ring, which we did not increase. I hope to have another look at the second ring in a future post. Normally the second ring is used for jumbo frame traffic, but I’m not clear why my guests are using it as my MTU is set to 1500. The first ring can clearly handle the deluge of packets now and didn’t exhaust once during the test.

Memory Overhead Impact

VMware’s KB 2039495 mentions increased overhead with larger receive buffers. This isn’t surprising as the guest OS needs to pre-allocate memory to use for this purpose.

Let’s have a look at what kind of increase happened on my Linux VM. To do this, I did a fresh reboot of the VM with the default RX ring, waited five minutes for things to settle and recorded the memory utilization. I then repeated this with the maxed out RX buffer.
After five minutes with the default ring:

root@iperf-test1:~# free
             total       used       free     shared    buffers     cached
Mem:       1034088      74780     959308          0       9420      35352
-/+ buffers/cache:      30008    1004080
Swap:       265212          0     265212

And again after increasing the RX ring to 4096K:

root@iperf-test1:~# free
             total       used       free     shared    buffers     cached
Mem:       1034088      93052     941036          0       9444      35352
-/+ buffers/cache:      48256     985832
Swap:       265212          0     265212

Although ~18MB may not seem like a lot of overhead, that could certainly add up over hundreds or thousands of VMs. Obviously ESXi’s memory sharing and conservation techniques will help to reduce this burden, but the key point to remember is that this extra buffering is not free from a resource perspective.

A good rule of thumb I like to tell customers is that increasing RX buffers is a great idea – just as long as a VM will actually benefit from it. The default values are probably sufficient for the vast majority of VM workloads, but if you have a VM exhibiting buffer exhaustion, there is no reason not to boost it up. I’d also go so far as to say that if you have a particular VM that you know will be traffic heavy – perhaps a new SQL box, or file server – proactively boost the buffers to the maximum possible.

Frequently Asked Questions

Q: Why are VMs more susceptible to buffer exhaustion and I don’t see these types of issues with physical servers?

A: This generally comes down to compute resources. If a VM – or physical server for that matter – can quickly process incoming frames, it’s unlikely that the buffer will get to a point where it’s full. When you have dozens or hundreds of VMs on a host all competing for compute resources, the guest may not be able to act on incoming frames quickly enough and the buffer can fill. A guest’s processing abilities may vary greatly from one moment to the next as well, which increase the risk of exhaustion.

Q: Shouldn’t TCP window scaling prevent packet loss?

A: That is mostly correct – TCP will scale the flow of segments based on network conditions, but because the loss of TCP segments is the trigger for scaling back, it’s quite likely that the buffer had to be exhausted at least once already before TCP starts reducing window size. Because a VM’s processing capability can vary due to the shared nature of resources on the hypervisor, what was fine from a TCP perspective one moment may be too heavy in the next.

Q: Does having a larger receive buffer have any disadvantages?

A: This will depend on the type of applications you are using in the guest. Having a larger buffer means that more frames can queue up. Rather than dropping frames, some frames may still make it to the guest but latency may be slightly increased. In some real-time applications like voice or video, this may not be desirable and packet loss is preferred. That said, most transaction based workloads like DB, email and file services would benefit from a larger buffer.

Q: What about increased overhead?

A: As mentioned, there is a small amount of memory overhead that the guest will use for the increased buffering. This is generally insignificant unless it’s increased across the entire environment. I generally recommend only increasing the RX buffers on VMs that will actually benefit from it.

Q: Can’t the physical NICs on the ESXi host contribute to packet loss as well?

A: They certainly can. Physical NICs on the ESXi hypervisor also have RX queues. These will vary from vendor to vendor and are sometimes configurable in the NIC driver parameters. Most physical NIC drivers are tuned appropriately for heavy traffic bursts and it isn’t usual to experience any amount of significant loss due to NIC buffers on modern hardware.

Q: Why does my Windows VM have only one RX queue?

A: The Windows VMXNET3 driver has RSS (Receive Side Scaling) disabled by default. Most modern Linux kernels will enable multiqueue support out of the box, but in Windows this will need to be turned on. Be sure to test thoroughly that RSS works correctly and that you see performance benefit.

Q: Is there any impact to existing flows when modifying RX buffers?

A: It’s always best to assume that there will be a traffic impact when modifying any driver settings like RX ring sizes. I can confirm that Windows disables/enables the adapter briefly when changing advanced VMXNET3 settings. The outage may be brief, but you’ll want to proceed with caution in a production environment.

Q: What if I’m using an E1000 adapter or something other than VMXNET3?

A: E1000 and other adapter types will often allow the tweaking of buffers as described in VMware KB 1010071. VMXNET3 has the largest configurable RX buffer sizes available of all the adapters and many other benefits. Unless there is a very specific reason for using an E1000 or other type of adapter, you should really consider moving to VMXNET3.

Conclusion

And there you have it – an often misunderstood setting that can help to mitigate packet loss and improve performance. Please feel free to leave any questions or comments below.

Building a Retro Gaming Rig – Part 2

In Part 1 of this series, I took a close look at the Asus P2B slot-1 motherboard and some CPU options. Today, the focus will be on graphics cards.

Evaluating Video Card Options

If this were a pure a DOS build, the card I chose wouldn’t matter very much. I’d simply be interested in something with decent 2D image quality and enough VRAM to support the resolutions I’d want to use. The majority of DOS games don’t offer 3D acceleration but because I wanted a system that would run some Windows 9x based titles, a 3D card would be ideal for that genuine experience.

The years leading up to the millennium were very exciting from a graphics hardware perspective. In 1998, 3D accelerated cards were much more accessible to the average consumer and major strides were made in performance and price.

3dfx Interactive

If you were at all into PC gaming or PC hardware back in the late nineties, you will be well acquainted with 3dfx Interactive. Although the company went bankrupt in 2002, 3dfx was the pioneer in mainstream 3D gaming technology up until that point. The card that brought them to fame was their 3D-only ‘Voodoo Graphics’ adapter released in 1996. They were also well known for their proprietary Glide API that many games of the era supported. The turning point for me personally was when I first experienced GLQuake on a Voodoo card back in 1996. I remember being totally blown away after seeing the surreal lighting and 3D effects and was determined to buy one. After saving up while working a summer job in 1997, I bought the 6MB Canopus Pure 3D card based on the famous chipset and never regretted it for a moment. I used that card for several years until buying a Voodoo 3 3000 later on.

The 3dfx Voodoo Banshee

Being that my retro rig was supposed to be of a 1998 vintage, I had originally started looking at the Voodoo 2 released that year, which was superior to the original in every respect. My next choice was the AGP version of the 3dfx Voodoo 3. Unfortunately, both of these models were fetching well over $100 on eBay, and outside of my budget for this weekend project. They are simply in high demand and are genuine collectors items at this point.

Undeterred, I shifted focus to another part that was a bit less popular but still had all the 3dfx flare of 1998 – the 3dfx Voodoo Banshee. The Banshee was a first for 3dfx in two respects – it was their first single chip 2D/3D solution and their first AGP based card. The 3D portion of the core is almost the same as the popular Voodoo 2, but had only one texture mapping unit instead of two. The resulting performance drop was partially offset by a higher core and memory clock speed but it still had plenty of power for games of that generation. As an added bonus, the 2D performance and image quality of the Banshee was excellent – almost as good as the top notch 2D cards of the time and removed the need for a separate 2D and 3D card in the system.

Today, the Banshee cards don’t sell for quite as much as the Voodoo 2 or Voodoo 3, but can still go for between $60 and $120 on eBay.

3dfx1-1

I was fortunate enough to stumble upon an awesome deal on a 16MB Ensoniq brand Banshee card for only $25 USD. It was probably only still available because the eBay listing didn’t include the words ‘3dfx’ or ‘Voodoo’ in the description. After the shipping, duty and exchange I think it came to under $50 CDN. Not too bad for a rare piece of history.

Ensoniq no longer exists, but was best known for their audio products. They were purchased by Creative Labs shortly after this card was produced. After seeing some of the other Banshee models from other companies, I actually like this card best because of the larger heatsink on it. Most of the other cards have rather small chipset style coolers. They all seem to be passive, but I can confirm that these cards do get rather toasty under load.

3dfx1-2

In 1998, dual monitor outputs and DVI were reserved only for elite workstation cards and specialty products. Just a simple VGA-out can be found on the Ensoniq Banshee. Some cards of this era differentiated themselves with TV S-Video and Composite outputs, but nothing that would interest anyone today.

It was interesting to see two mysterious headers on the card. One is a 40-pin female IDE-style connector, and the other is a 26 pin header of some sort. My first thought was that they may have been for future SLI support, but after doing some digging these turned out to be standard feature connectors. These were found on many cards of this vintage for daughter boards and other add-on modules. Technically, the 36 pin SLI connector found on Voodoo 2 cards is indeed a feature connector of sorts as it allows the cards to communicate over a high-bandwidth channel without having to saturate the PCI bus.

AGP Variations

The earliest 3D cards were all PCI based, but in 1998 many first generation AGP variants were on the market, including the Banshee. A lot of people building retro rigs seem to forget that there were actually several AGP ‘versions’ to hit the market over the years and not all were backward compatible. An AGP 1x card released in 1998 very likely won’t work or even fit in an AGP board released a few years later.

retro-slot1-23

AGP 1x and 2x are very similar and provide power to the card at 3.3V. AGP 4x and 8x came out a few years later and were limited to 1.5V. As you can imagine, putting 3.3V through a card designed for 1.5V could be pretty catastrophic. To prevent these problems, slot and card ‘keying’ were implemented to ensure you couldn’t physically insert a card into an incompatible slot. As you can see above, the 3.3V AGP 1x slot on the ASUS P2B has a plastic ‘key’ about a third of the way in from the left of the slot. A 1.5V 4x/8x slot would have that same key, but in a different position closer to the right of the slot.

Interestingly, some cards that came out later on were actually compatible with both 1.5V and 3.3V AGP slots and had two key indentations on the card. The card in the image above is a 2003 era ATI Radeon 9200 SE. I have used this same card in a newer 8x slot machine as well as the 1x ASUS P2B.

The budget priced Radeon 9200 SE is actually about 9 times more powerful than the Banshee. I had considered using this Windows 98 compatible card in my retro build, but it just didn’t have the same nostalgic value that a genuine 3dfx card does. For now, it’s been serving as a spare.

Conclusion

I was really excited to get a 3dfx card – especially without having to shell out too much. It’ll be great to use a card from that era and to use the 3dfx proprietary glide API for many games as well. I think that the Banshee is a perfect match for the build and I look forward to getting everything put together!

Next up, I’ll take a look at a couple of sound card options, including both ISA and PCI cards. In an added twist, I also bought a compact micro ATX socket 370 system that I’ll evaluate as well.

Building a Retro Gaming Rig Series

Part 1 – ASUS P2B and Slot-1 CPUs
Part 2 – 3D Video Card Options