ESG/DLR tmpfs partition fills in NSX 6.3.6 and 6.4.1

If you are running NSX 6.3.6 or 6.4.1, you should take a close look at VMware KB 57003. A newly discovered issue can result in the tmpfs partition of DLRs and ESGs from filling up, rendering the appliances unmanageable.

On a positive note, there should be no datapath impact because of a full tmpfs partition. You just won’t be able to push any configuration changes to the ESG or DLR in this state.
This occurs because of a file related to HA in /run that will slowly grow until it fills the partition. The file in question is ‘ha.cid.Out’ and contains HA diagnostic information. You can find it in the /run/vmware/vshield/cmdOut directory.

If you have a very stable environment, it’s quite possible that you’ll never run into this problem. The ha.cid.Out file is created and updated only after an HA event occurs – like a failover or split-brain recovery for example. Once the file is created, however, it receives regular updates and will inevitably grow.

Based on the rate in which the file grows, a compact size ESG or DLR has about a month after an HA event before this becomes a problem. Larger sized ESGs have more memory, and hence larger tmpfs partitions. Below is an estimate based on tmpfs partition size on each size of appliance:

All DLRs (256MB tmpfs): 4 weeks
Compact ESG (256MB tmpfs): 4 weeks
Large ESG (497MB tmpfs): 8 weeks
Quad Large ESG (1024MB tmpfs): 4 months
X-Large ESG (3.9GB tmpfs): >1 year

Unfortunately, it doesn’t appear that the ha.cid.Out file can be deleted or purged while the ESG/DLR is in operation. The file is locked for editing and the only safe way to recover is to reboot the appliance. Again, all of the features including routing and packet forwarding will continue to work just fine with a full tmpfs partition. You just won’t be able to make any changes.

Disabling ESG HA will prevent this from happening, but I’d argue that being protected by HA is more important than the potential for an ESG to become unmanageable.

You can monitor your ESG’s tmpfs partition using the show system storage CLI command:

esg-lb1.vswitchzero.net-0> show system storage
Filesystem      Size   Used   Avail   Use%   Mounted on
/dev/root       444M 366M 55M 88% /
tmpfs           497M 80K 497M 1% /run
/dev/sda2        43M 2.2M 38M 6% /var/db
/dev/sda3        27M 413K 25M 2% /var/dumpfiles
/dev/sda4        32M 1.1M 29M 4% /var/log

If you see it slowly creeping up in size at a regular interval, it would be a good idea to start planning for a maintenance window to reboot the appliance.

I can’t comment on release dates, but it’s very likely that this will be fixed next release of 6.4.x, which should out very soon. The 6.3.x fix for this may be further out, so a jump to 6.4.2 may be your best bet if this proves to a serious problem for you.

I hope this is helpful.

Home Lab

One of the most important tools I use day-to-day is my lab. Although I’m fortunate to have access to some shared lab resources at VMware, I still choose to maintain a dedicated home lab. I like to have the freedom to build it up, tear it down and configure it in any way I see fit.

I’ve had a few people ask me about my home lab recently, so I wanted to take a moment to share my setup. I’m not going to go too much into how I use the lab, or the software side of things but will stay focused on the hardware for now.

My Goals

I’ve had several iterations of home lab over the years, but my most recent overhaul was done about two years ago in 2016. At that time, I had several goals in mind:

  1. To keep cost low. I chose mainly EOL, second hand hardware that was relatively inexpensive. I often looked for the ‘sweet spot’ to get the best performance for the dollar.
  2. To use server/workstation grade hardware wherever possible. I’ve had some mixed experiences with consumer grade equipment and prefer having IPMI and being able to run large amounts of registered ECC memory.
  3. Low noise. I really didn’t like the noise and heat generated by rackmount gear and tried to stick with custom-build server systems wherever possible.
  4. Power efficiency. Building custom machines with simple cooling systems allowed me to keep power consumption down. I also didn’t see the point of running the lab 24/7 and chose to automate power on and power off activities.
  5. Sized right. Although more RAM and compute power is always desirable, I tried to keep things reasonably sized to keep costs and power consumption down. I wanted to be able to have some flexibility, but would try to keep VMs sized smaller and power down what I didn’t need.

The Lab

homelabv1

I’ll get more into each component, but here’s a summary:

  • 1x Management Node (2x Xeon E5-2670, 96GB RAM)
  • 3x Compute Nodes (Xeon X3440, 16GB RAM)
  • 1x FreeNAS Server (Dell T110, Xeon 3430, 8GB RAM)
  • 1x Raspberry Pi 3 Model B (Automation and remote access)
  • Quanta LB6M 24 port 10Gbps Switch (24x SFP+ ports)
  • D-link DGS-1210-16 Managed Switch (16x copper ports, 4x SFP)
  • Cyber Power PFCLCD1500 UPS system

All of the equipment sits comfortably in a wire shelf/rack in a corner of my unfinished basement. Here it can stay nice and cool and the noise it generates doesn’t bother anyone.

Continue reading “Home Lab”

Jumbo Frames and VXLAN Performance

VXLAN overlay technology is part of what makes software defined networking possible. By encapsulating full frames into UDP datagrams, L2 networks can be stretched across all manners of routed topologies. This breaks down the barriers of physical networking and builds the foundation for the software defined datacenter.

VXLAN or Virtual Extensible LAN is an IETF standard documented in RFC 7348. L2 over routed topologies is made possible by encapsulating entire L2 frames into UDP datagrams. About 50 bytes of outer header data is added to every L2 frame because of this, meaning that for every frame sent on a VXLAN network, both an encapsulation and de-encapsulation task must be performed. This is usually performed by ESXi hosts in software but can sometimes be offloaded to physical network adapters as well.

In a perfect world, this would be done without any performance impact whatsoever. The reality, however, is that software defined wizardry often does have a small performance penalty associated with it. This is unavoidable, but that doesn’t mean there isn’t anything that can be done to help to minimize this cost.

If you’ve been doing some performance testing, you’ve probably noticed that VMware doesn’t post statements like “You can expect X number of Gbps on a VXLAN network”. This is because there are simply too many variables to consider. Everything from NIC type, switches, drivers, firmware, offloading features, CPU count and frequency can play a role here. All these factors must be considered. From my personal experience, I can say that there is a range – albeit a somewhat wide one – of what I’d consider normal. On a modern 10Gbps system, you can generally expect more than 4Gbps but less than 7Gbps with a 1500 MTU. If your NIC supports VXLAN offloading, this can sometimes be higher than 8Gbps. I don’t think I’ve ever seen a system achieve line-rate throughput on a VXLAN backed network with a 1500 MTU regardless of the offloading features employed.

What if we can reduce the amount of encapsulation and de-encapsulation that is so taxing on our hypervisors? Today I’m going to take an in-depth look at just this – using an 8900 MTU to reduce packet rates and increase throughput. The results may surprise you!

Continue reading “Jumbo Frames and VXLAN Performance”

No Bridged Adapters in VMware Workstation

Although I only support vSphere and VMware’s enterprise products, I use VMware Workstation every day. My work laptop runs Windows 10, but I maintain a couple of Linux VMs for day to day use as well. After a large Windows 10 feature update – 1709 I believe – I noticed that my Linux VMs were booting up without any networking. Their virtual adapters were simply reporting ‘link down’.

I had not changed any of the Workstation network configuration since I had installed it and always just used the defaults. For my guest VMs, I had always preferred to use ‘Bridged’ networking rather than NAT:

wsnets-1

What I found odd was that the VMnet0 connection usually associated with bridging was nowhere to be found in the ‘Virtual Network Editor’.

wsnets-2

When trying to add a new bridged network, I’d get the following error:

wsnets-3

The exact text is:

“Cannot change network to bridged: There are no un-bridged host network adapters.”

Clearly, Workstation thinks the adapters are already bridged despite there not being any listed in the virtual network editor.

Continue reading “No Bridged Adapters in VMware Workstation”

NSX Troubleshooting Scenario 11 – Solution

Welcome to the eleventh installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was seeing some HA heartbeat channel alarms in the new HTML5 NSX dashboard.

tshoot11a-1

After doing some digging, we were able to determine that the ESG had an interface configured for HA on VLAN 16 and that from the CLI, the edge really was complaining about being unable to reach its peer.

tshoot11a-3

You probably noticed in the first half, that the HA interface doesn’t have an IP address configured. This may look odd, but it’s fine. Even if you did specify a custom /30 IP address for HA purposes, it would not show up as an interface IP address here. Rather, you’d need to look for one specified in the HA configuration settings here:

Continue reading “NSX Troubleshooting Scenario 11 – Solution”

NSX Troubleshooting Scenario 11

Welcome to the eleventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“One of my ESXi hosts has a hardware problem. Ever since putting it into maintenance mode, I’m getting edge high availability alarms in the NSX dashboard. I think this may be a false alarm, because the two appliances are in the correct active and standby roles and not in split-brain. Why is this happening?”

A good question. This customer is using NSX 6.4.0, so the new HTML5 dashboard is what they are referring to here. Let’s see the dashboard alarms first hand.

tshoot11a-1

This is alarm code 130200, which indicates a failed HA heartbeat channel. This simply means that the two ESGs can’t talk to each other on the HA interface that was specified. Let’s have a look at edge-3, which is the ESG in question.

Continue reading “NSX Troubleshooting Scenario 11”

Using the Upgrade Coordinator in NSX 6.4

If you’ve ever gone through an NSX upgrade, you know how many components there are to upgrade. You’ve got your NSX manager appliances, control cluster, ESXi host VIBs, edges, DLR and even guest introspection appliances. In the past, every one of these needed to be upgraded independently and in the correct order.

VMware hopes to make this process a lot more straight forward with the release of the new ‘Upgrade Coordinator’ feature. This is now included as of 6.4.0 in the HTML5 client.
The aim of the upgrade coordinator is to create an upgrade plan or checklist and then to execute this in the correct order. There are many aspects of the upgrade plan than can be customized but for those looking for maximum automation – a single click upgrade option exists as well.

It is important to note that although the upgrade coordinator helps to take some of the guess work out of upgrading, there are still tasks and planning you’ll want to do ahead of time. If you haven’t already, please read my Ten Tips for a Successful NSX Upgrade post.

Today I’ll be using the upgrade coordinator to go from 6.3.3 to 6.4.0 and walk you through the process.

Upgrading NSX Manager

Although the upgrade coordinator plan covers numerous NSX components, NSX manager is not one of them. You’ll still need to use the good old manager UI upgrade process as described on page 36 of the NSX 6.4 upgrade guide. Thankfully, this is the easiest part of the upgrade.

You’ll also notice that I can use the upgrade coordinator for my lab upgrade even though I’m at a 6.3.x release currently. This is because the NSX manager is upgraded first, adding this management plane functionality to be used for the rest of the upgrade.

Note: If you are using a Cross-vCenter deployment of NSX, be sure to upgrade your primary, followed by all secondary managers before proceeding with the rest of the upgrade.

upgco-1

Upgrading NSX Manager to 6.4.x should look very familiar as the process really hasn’t changed. Be sure to heed the warning banner about taking a backup before proceeding. For more info on this, please see my Ten Tips for a Successful NSX Upgrade post.

Continue reading “Using the Upgrade Coordinator in NSX 6.4”

The 286 Revival

Being a retro PC enthusiast, my eyes are always open for deals on old hardware. A couple of weeks ago I came across an eBay listing for an as-is “Motherboard with ISA slots”. Looking closely at the posted images, I could see that the board was late-80s to early-90s vintage with sockets for individual memory ICs rather than the usual 30-pin SIMMs. Straining my eyes, I could faintly make out the markings on a Siemens brand 12MHz 286 processor. Having never owned a 286, I thought this may make a fun new project.

 

It was listed as-is because the seller didn’t have the hardware to test it. This is always a risky proposition, but when dealing with AT based systems, chances are that most people genuinely won’t have what’s needed. This is especially true if the seller doesn’t specialize in vintage hardware – which seemed to be the case here. At only $17.99 CDN, I thought it was worth the risk and I bought it.

Continue reading “The 286 Revival”

Missing Labels in the HTML5 Plugin with NSX 6.4.

If you recently upgraded to NSX 6.4, you are probably anxious to check out the new HTML5 plugin. VMware added some limited functionality in HTML5, including the new dashboard, upgrade coordinator as well as packet capture and support bundle collection tools. After upgrading NSX manager, you may notice that the plugin does not look the way it should. Many labels are missing. Rather than seeing tab titles like ‘Overview’ and ‘System Scale’ you see ‘dashboard.button.label.overview’ and ‘dashboard.button.label.systemScale’:

html5labels-1

Obviously, things aren’t displaying as they should be, and some views – like the upgrade coordinator – are practically unusable:

html5labels-2

Continue reading “Missing Labels in the HTML5 Plugin with NSX 6.4.”

Console Mouse Not Working in Windows VMs

I recently ran into some problems while deploying a Windows Server 2012 R2 VM in my vSphere 6.5 U2 lab. I’ve come to expect that the console mouse response is going to be terrible until VMware Tools is installed, but for some odd reason I had no mouse control whatsoever. Thinking it may be a quirk of the Web Console, I tried both the Remote Console and the HTML5 client to no avail.

The VM appeared to be healthy and would register keyboard input, but the motion of the mouse cursor was erratic or the cursor would not move at all. Thinking that I just needed to battle on and get Tools installed, I attempted to use the keyboard for this purpose – what a chore. You think it would have been easy, but the installer kept losing focus and falling behind other open windows. Many of the windows keyboard shortcuts I’d normally use were not functioning because they register on my laptop – not in the console. I couldn’t RDP to the VM either because the NIC needed to be configured with a valid IP address.

After doing a bit of research, it appeared that display scaling could cause all sorts of mouse issues – but this didn’t appear to be applicable in my case. That’s when I stumbled upon a communities thread that mentioned adding a USB controller to the VM. Even though my VM was ‘Hardware Version 13’, the USB 2.0 controller isn’t added by default.

I managed to get to the device manager using the keyboard, and you can see that the virtual hardware will use a PS/2 a mouse in the absence of a USB controller:

consolemouse-2

I then went ahead and added the basic USB 2.0 controller to the VM and booted it up.

Continue reading “Console Mouse Not Working in Windows VMs”