performance – vswitchzero

An In-depth Look at SR-IOV NIC Passthrough

SR-IOV or “Single Root I/O Virtualization” is a very interesting feature that can provide virtual machines shared access to physical network cards installed in the hypervisor. This may sound a lot like what a virtual NIC and a vSwitch does, but the feature works very similarly to PCI passthrough, granting a VM direct access to the NIC hardware. In order to understand SR-IOV, it helps to understand how PCI passthrough works. Here is a quote from a post I did a few years ago:

“PCI Passthrough – or VMDirectPath I/O as VMware calls it – is not at all a new feature. It was originally introduced back in vSphere 4.0 after Intel and AMD introduced the necessary IOMMU processor extensions to make this possible. For passthrough to work, you’ll need an Intel processor supporting VT-d or an AMD processor supporting AMD-Vi as well as a motherboard that can support this feature.

In a nutshell, PCI passthrough allows you to give a virtual machine direct access to a PCI device on the host. And when I say direct, I mean direct – the guest OS communicates with the PCI device via IOMMU and the hypervisor completely ignores the card.”

SR-IOV takes PCI passthrough to the next level. Rather than granting exclusive use of the device to a single virtual machine, the device is shared or ‘partitioned’. It can be shared between multiple virtual machines, or even shared between virtual machines and the hypervisor itself. For example, a single 10Gbps NIC could be ‘passed through’ to a couple of virtual machines for direct access, and at the same time it could be attached to a vSwitch being used by other VMs with virtual NICs and vmkernel ports too. Think shared PCI passthrough.

Continue reading “An In-depth Look at SR-IOV NIC Passthrough”

Low-Voltage DDR3 – Performance, Heat and Power

When I built my new compute nodes, I chose PC3L-12800R DIMMs that could run in both standard 1.5V and 1.35V low-voltage modes. What I didn’t realize, however, is that Intel’s specification for 1.35V registered memory on Socket R based systems limits low-voltage operation to 1333MHz. The Supermicro X9SRL-F boards I’m using enforce this, despite the modules being capable of the full 1600MHz at 1.35V.

This meant I could run the modules at 1333MHz and enjoy the power/heat reduction that goes with 1.35V operation, or I could force the modules to run at 1600MHz at the ‘standard’ 1.5 volts. As I considered different configuration options, a lot of questions came to mind. Today I’ll be taking an in-depth look at DDR3 memory bandwidth, latency, power consumption and heat. I hope to answer the following questions in this post:

What is the bandwidth difference between 1333MHz with tighter 9-9-9 timings versus 1600MHz at looser 11-11-11 timings?
What is the latency difference between 1333MHz with tighter 9-9-9 timings versus 1600MHz at looser 11-11-11 timings?
Just how much power savings can be realized at 1.35V versus 1.5V?
How much cooler will my DIMMs run at 1.35V versus 1.5V?
What about overclocking my DIMMs to 1867MHz at 12-12-12 timings?
How does memory bandwidth and latency fair with fewer than 4 DIMMs populated on the SandyBridge-EP platform?

A Note On Overclocking

Although there are IvyBridge-EP CPUs that support higher 1866MHz DIMM operation, my SandyBridge-EP based E5-2670s and PC3L-12800 DIMMs do not officially support this. The Supermicro X9SRL-F allows the forcing of 1866MHz at 12-12-12 timings even though the DIMMs don’t have this speed in their SPD tables. Despite being server-grade hardware, this is an ‘overclock’ plain and simple.

I would never consider doing this in a production environment, but in a lab for educational purposes – it was worth a shot. Thankfully, to my surprise, the system appears completely stable at 1866MHz and 1.5V. Because the DIMMs can technically run at 1600MHz with a lower 1.35V, it would seem logical that at 1.5V they’d have additional headroom available. This certainly appears to be true in my case – even with a mixture of three different brands of PC3L-12800R.

With my overclocking success, I decided to include results for 1866MHz operation at 1.5V to see if it’s worth pushing the DIMMs beyond their rated specification.

Bandwidth and Latency

Back in my days as an avid overclocker, increasing memory frequency would generally provide better overall performance than tightening the CAS, RAS and other timings of the modules. Obviously doing both was ideal but if you had to choose between higher frequency and tighter timings, you’d usually come out ahead by sacrificing tighter timings for higher frequencies.

I fully expected the 1600MHz memory running at 11-11-11 timings to be quicker overall than 1333MHz at 9-9-9, but the question was just how much quicker. To test, I booted one of my X9SRL-F systems from a USB stick running Lubuntu 18.04 and used Intel’s handy ‘Memory Latency Checker’ (MLC) tool. MLC allows you to get several performance metrics including peak bandwidth, latency as well as various cross-socket NUMA measurements that can be handy for multi-socket processor systems. I’ll only be looking at a few metrics:

Peak Read Bandwidth
1:1 Reads/Writes Bandwidth
Idle Latency

memspeed-1

As I expected, despite the looser timings, memory bandwidth increases substantially as the frequency increases. About a 12% boost is obtained going to 1600MHz at 11-11-11 timings. An additional 10% was obtained by overclocking the modules to 1866MHz. Mixed reads and writes get a pretty proportional boost in all three tests.

Continue reading “Low-Voltage DDR3 – Performance, Heat and Power”

Jumbo Frames and VXLAN Performance

Using an 8900 MTU for better VXLAN throughput and lower packet rates.

VXLAN overlay technology is part of what makes software defined networking possible. By encapsulating full frames into UDP datagrams, L2 networks can be stretched across all manners of routed topologies. This breaks down the barriers of physical networking and builds the foundation for the software defined datacenter.

VXLAN or Virtual Extensible LAN is an IETF standard documented in RFC 7348. L2 over routed topologies is made possible by encapsulating entire L2 frames into UDP datagrams. About 50 bytes of outer header data is added to every L2 frame because of this, meaning that for every frame sent on a VXLAN network, both an encapsulation and de-encapsulation task must be performed. This is usually performed by ESXi hosts in software but can sometimes be offloaded to physical network adapters as well.

In a perfect world, this would be done without any performance impact whatsoever. The reality, however, is that software defined wizardry often does have a small performance penalty associated with it. This is unavoidable, but that doesn’t mean there isn’t anything that can be done to help to minimize this cost.

If you’ve been doing some performance testing, you’ve probably noticed that VMware doesn’t post statements like “You can expect X number of Gbps on a VXLAN network”. This is because there are simply too many variables to consider. Everything from NIC type, switches, drivers, firmware, offloading features, CPU count and frequency can play a role here. All these factors must be considered. From my personal experience, I can say that there is a range – albeit a somewhat wide one – of what I’d consider normal. On a modern 10Gbps system, you can generally expect more than 4Gbps but less than 7Gbps with a 1500 MTU. If your NIC supports VXLAN offloading, this can sometimes be higher than 8Gbps. I don’t think I’ve ever seen a system achieve line-rate throughput on a VXLAN backed network with a 1500 MTU regardless of the offloading features employed.

What if we can reduce the amount of encapsulation and de-encapsulation that is so taxing on our hypervisors? Today I’m going to take an in-depth look at just this – using an 8900 MTU to reduce packet rates and increase throughput. The results may surprise you!

Continue reading “Jumbo Frames and VXLAN Performance”

Debunking the VM Link Speed Myth!

10Gbps from a 10Mbps NIC? Why not? Debunking the VM link speed myth once and for all!

** Edit on 11/6/2017: I hadn’t noticed before I wrote this post, but Raphael Schitz (@hypervisor_fr) beat me to the debunking! Please check out his great post on the subject as well here. **

I have been working with vSphere and VI for a long time now, and have spent the last six and a half years at VMware in the support organization. As you can imagine, I’ve encountered a great number of misconceptions from our customers but one that continually comes up is around VM virtual NIC link speed.

Every so often, I’ll hear statements like “I need 10Gbps networking from this VM, so I have no choice but to use the VMXNET3 adapter”, “I reduced the NIC link speed to throttle network traffic” and even “No wonder my VM is acting up, it’s got a 10Mbps vNIC!”

I think that VMware did a pretty good job documenting the role varying vNIC types and link speed had back in the VI 3.x and vSphere 4.0 era – back when virtualization was still a new concept to many. Today, I don’t think it’s discussed very much. People generally use the VMXNET3 adapter, see that it connects at 10Gbps and never look back. Not that the simplicity is a bad thing, but I think it’s valuable to understand how virtual networking functions in the background.

Today, I hope to debunk the VM link speed myth once and for all. Not with quoted statements from documentation, but through actual performance testing.

Continue reading “Debunking the VM Link Speed Myth!”

Boosting vSphere Web Client Performance in ‘Tiny’ Deployments

Getting service health alarms and poor Web Client performance in ‘Tiny’ size deployments? A little extra memory can go a long way if allocated correctly!

In my home lab, I’ve been pretty happy with the vCenter Server ‘Tiny’ appliance deployment size. For the most part, vSphere Web Client performance has been decent and the appliance doesn’t need a lot of RAM or vCPUs.

When I most recently upgraded my lab, I considered using a ‘Small’ deployment but really didn’t want to tie up 16GB of memory – especially with only a small handful of hosts and many services offloaded to an external PSC

Although things worked well for the most part, I had recently been getting vCenter alarms and would get occasional periods of slow refreshes and other oddities.

vsphereclientmem-1 — One of two alarms triggering frequently in my lab environment.

The two specific alarms were service health status alarms with the following text strings associated:

The vmware-dataservice-sca status changed from green to yellow

I’d also see this accompanied by a similar message referring to the vSphere Web Client:

The vsphere-client status changed from green to yellow

After doing some searching online, I quickly found VMware KB 2144950 on the subject. Although the cause of this seems pretty clear – insufficient memory allocation to the vsphere-client service – the workaround steps outlined in the KB are lacking context and could use some elaboration.

Continue reading “Boosting vSphere Web Client Performance in ‘Tiny’ Deployments”

VM Network Performance and CPU Scheduling

Over the years, I’ve been on quite a few network performance cases and have seen many reasons for performance trouble. One that is often overlooked is the impact of CPU contention and a VM’s inability to schedule CPU time effectively.

Today, I’ll be taking a quick look at the actual impact CPU scheduling can have on network throughput.

Testing Setup

To demonstrate, I’ll be using my dual-socket management host. As I did in my recent VMXNET3 ring buffer exhaustion post, I’ll be testing with VMs on the same host and port group to eliminate bottlenecks created by physical networking components. The VMs should be able to communicate as quickly as their compute resources will allow them.

Physical Host:

2x Intel Xeon E5 2670 Processors (16 cores at 2.6GHz, 3.3GHz Turbo)
96GB PC3-12800R Memory
ESXi 6.0 U3 Build 5224934

VM Configuration:

1x vCPU
1024MB RAM
VMXNET3 Adapter (1.1.29 driver with default ring sizes)
Debian Linux 7.4 x86 PAE
iperf 2.0.5

The VMs I used for this test are quite small with only a single vCPU and 1GB of RAM. This was done intentionally so that CPU contention could be more easily simulated. Much higher throughput would be possible with multiple vCPUs and additional RX queues.

The CPUs in my physical host are Xeon E5 2670 processors clocked at 2.6GHz per core. Because this processor supports Intel Turbo Boost, the maximum frequency of each core will vary depending on several factors and can be as high as 3.3GHz at times. To take this into consideration, I will test with a CPU limit of 2600MHz, as well as with no limit at all to show the benefit this provides.

To measure throughput, I’ll be using a pair of Debian Linux VMs running iperf 2.0.5. One will be the sending side and the other the receiving side. I’ll be running four simultaneous threads to maximize throughput and load.

I should note that my testing is far from precise and is not being done with the usual controls and safeguards to ensure accurate results. This said, my aim isn’t to be accurate, but rather to illustrate some higher-level patterns and trends.

Continue reading “VM Network Performance and CPU Scheduling”

VMXNET3 RX Ring Buffer Exhaustion and Packet Loss

ESXi is generally very efficient when it comes to basic network I/O processing. Guests are able to make good use of the physical networking resources of the hypervisor and it isn’t unreasonable to expect close to 10Gbps of throughput from a VM on modern hardware. Dealing with very network heavy guests, however, does sometimes require some tweaking.

I’ll quite often get questions from customers who observe TCP re-transmissions and other signs of packet loss when doing VM packet captures. The loss may not be significant enough to cause a real application problem, but may have some performance impact during peak times and during heavy load.

After doing some searching online, customers will quite often land on VMware KB 2039495 and KB 1010071 but there isn’t a lot of context and background to go with these directions. Today I hope to take an in-depth look at VMXNET3 RX buffer exhaustion and not only show how to increase buffers, but to also to determine if it’s even necessary.

Rx Buffering

Not unlike physical network cards and switches, virtual NICs must have buffers to temporarily store incoming network frames for processing. During periods of very heavy load, the guest may not have the cycles to handle all the incoming frames and the buffer is used to temporarily queue up these frames. If that buffer fills more quickly than it is emptied, the vNIC driver has no choice but to drop additional incoming frames. This is what is known as buffer or ring exhaustion.

Continue reading “VMXNET3 RX Ring Buffer Exhaustion and Packet Loss”