NSX 6.4.3 Now Available!

Express maintenance release fixes two discovered issues.

If it feels like 6.4.2 was just released, you’d be correct – only three weeks ago. The new 6.4.3 release (build 9927516) is what’s referred to as an express maintenance release. These releases aim to correct specific customer identified problems as quickly as possible rather than having to wait many months for the next full patch release.

In this release, only two identified bugs have been fixed. The first is an SSO issue that can occur in environments with multiple PSCs:

“Fixed Issue 2186945: NSX Data Center for vSphere 6.4.2 will result in loss of SSO functionality under specific conditions. NSX Data Center for vSphere cannot connect to SSO in an environment with multiple PSCs or STS certificates after installing or upgrading to NSX Data Center for vSphere 6.4.2.”

The second is an issue with IPsets that can impact third party security products – like Palo Alto Networks and Checkpoint Net-X services for example:

“Issue 2186968: Static IPset not reported to containerset API call. If you have service appliances, NSX might omit IP sets in communicating with Partner Service Managers. This can lead to partner firewalls allowing or denying connections incorrectly. Fixed in 6.4.3.”

You can find more information on these problems in VMware KB 57770 and KB 57834.

So knowing that these are the only two fixes included, the question obviously becomes – do I really need to upgrade?

If you are running 6.4.2 today, you might not need to. If you have more than one PSC associated with the vCenter Server that NSX manager connects to, or if you use third party firewall products that work in conjunction with NSX, the answer would be yes. If you don’t, there is really no benefit to upgrading to 6.3.4 and it would be best to save your efforts for the next major release.

That said, if you were already planning an upgrade to 6.4.2, it only makes sense to go to 6.4.3 instead. You’d get all the benefits of 6.4.2 plus these two additional fixes.

Kudos goes out to the VMware NSBU engineering team for their quick work in getting these issues fixed and getting 6.4.3 out so quickly.

Relevant Links:

 

Advertisements

Manual Upgrade of NSX Host VIBs

Complete manual control of the NSX host VIB upgrade process without the use of vSphere DRS.

NSX host upgrades are well automated these days. By taking advantage of ‘fully automated’ DRS, hosts in a cluster can be evacuated, put in maintenance mode, upgraded, and even rebooted without any user intervention. By relying on DRS for resource scheduling, NSX doesn’t have to worry about doing too many hosts simultaneously and the process can generally be done without end-users even noticing.

But what if you don’t want this level of automation? Maybe you’ve got very sensitive VMs that can’t be migrated, or VMs pinned to hosts for some reason. Or maybe you just want maximum control of the upgrade process and which hosts are upgraded – and when.

There is no reason why you can’t have full control of the host upgrade process and leave DRS in manual mode. This is indeed supported.

Most of the documentation and guides out there assume that people will want to take advantage of DRS-driven upgrades, but this doesn’t mean it’s the only supported method. There is no reason why you can’t have full control of the host upgrade process and this is indeed supported. Today I’ll be walking through this in my lab as I upgrade to NSX 6.4.1.

Step 1 – Clicking the Upgrade Link

Once you’ve upgraded your NSX manager and control cluster, you should be ready to begin tackling your ESXi host clusters. Before you proceed, you’ll need to ensure your host clusters have DRS set to ‘Manual’ mode. Don’t disable DRS – that will get rid of your resource pools. Manual mode is sufficient.

Next, you’ll need to browse to the usual ‘Installation’ section in the UI, and click on the ‘Host Preparation’ tab. From here, it’s now safe to click the ‘Upgrade Available’ link on the cluster to begin the upgrade process. Because DRS is in manual mode, nothing will be able happen. Hosts can’t be evacuated, and as a result, VIBs can’t be upgraded. In essence, the upgrade has started, but immediately stalls and awaits manual intervention.

 

upgnodrs-3
This upgrade is essentially hung up waiting for hosts to enter maintenance mode.

 

In 6.4.1 as shown above, a clear banner message is displayed reminding you that DRS is in manual mode and that hosts must be manually put in maintenance mode.

Continue reading “Manual Upgrade of NSX Host VIBs”

NSX Controller Issues with vRNI 3.8

Just a quick PSA to let everyone know that vRNI 3.8 Patch 4 (201808101430) is now available for download. If you are running vRNI 3.8 with NSX, be sure to patch as soon as possible or disable controller polling based on instructions in the workaround section of KB 57311.

Some changes were made in vRNI 3.8 for NSX controller polling. It now uses both the NSX Central CLI as well as SSH sessions to obtain statistics and information. In some situations, excessive numbers of SSH sessions are left open and memory exhaustion on controller nodes can occur.

If you do find a controller in this state, a reboot will get it back to normal again. Just be sure to keep a close eye on your control cluster. If two or three go down, you’ll lose controller majority and will likely run into control plane and data plane problems for VMs on logical switches or VMs using DLR routing.

More information on the NSX controller out-of-memory issue can be found in VMware KB 57311. For more information on vRNI 3.8 Patch 4 – including how to install it – see VMware KB 57683.

NSX 6.4.2 Now Available!

It’s always an exciting day when a new build of NSX is released. As of August 21st, NSX 6.4.2 (Build 9643711) is now available for download. Despite being just a ‘dot’ release, VMware has included numerous functional enhancements in addition to the usual bug fixes.

One of the first things you’ll probably notice is that VMware is now referring to ‘NSX for vSphere’ as ‘NSX Data Center for vSphere’. I’m not sure the name has a good ring to it, but we’ll go with that.

A few notable new features:

More HTML5 Enabled Features: It seems VMware is adding additional HTML5 functionality with each release now. In 6.4.2, we can now access the TraceFlow, User Domains, Audit Logs and Tasks and Events sections from the HTML5 client.

Multicast Support: This is a big one. NSX 6.4.2 now supports both IGMPv2 and PIM Sparse on both DLRs and ESGs. I hope to take a closer look at these changes in a future post.

MAC Limit Increase: Traditionally, we’ve always recommended limiting each logical switch to a /22 or smaller network to avoid exceeding the 2048 MAC entry limit. NSX 6.4.2 now doubles this to 4096 entries per logical switch.

L7 Firewall Improvements: Additional L7 contexts added including EPIC, Microsoft SQL and BLAST AppIDs.

Firewall Rule Hit Count: Easily see which firewall rules are being hit, and which are not with these counters.

Firewall Section Locking: Great to allow multiple people to work on the DFW simultaneously without conflicting with eachother.

Additional Scale Dashboard Metrics: There are 25 new metrics added to ensure you stay within supported limits.

Controller NTP, DNS and Syslog: This is now finally exposed in the UI and fully supported. As someone who is frequently looking at log bundles, it’ll be nice to finally be able to have accurate time keeping on controller nodes.

On the resolved issues front, I’m happy to report that 6.4.2 includes 21 documented bug fixes. You can find the full list in the release notes, but a couple of very welcome ones include issues 2132361 and 2147002. Those who are using Guest Introspection on NSX 6.4.1 should consider upgrading as soon as possible due to the service and performance problems outlined in KB 56734. NSX 6.4.2 corrects this problem.

Another issue not listed in the release notes – it may have been missed – but fixed in 6.4.2 is the full ESG tmpfs partition issue in 6.4.1. You can find more information on this issue in KB 57003 as well as in a recent post I did on it here.

Here are the relevant links for NSX 6.4.2 (Build 9643711):

I’m looking forward to getting my lab upgraded and will be trying out a few of these new features. Remember, if you are planning to upgrade, be sure to do the necessary planning and preparation.

ESG/DLR tmpfs partition fills in NSX 6.3.6 and 6.4.1

If you are running NSX 6.3.6 or 6.4.1, you should take a close look at VMware KB 57003. A newly discovered issue can result in the tmpfs partition of DLRs and ESGs from filling up, rendering the appliances unmanageable.

On a positive note, there should be no datapath impact because of a full tmpfs partition. You just won’t be able to push any configuration changes to the ESG or DLR in this state.
This occurs because of a file related to HA in /run that will slowly grow until it fills the partition. The file in question is ‘ha.cid.Out’ and contains HA diagnostic information. You can find it in the /run/vmware/vshield/cmdOut directory.

If you have a very stable environment, it’s quite possible that you’ll never run into this problem. The ha.cid.Out file is created and updated only after an HA event occurs – like a failover or split-brain recovery for example. Once the file is created, however, it receives regular updates and will inevitably grow.

Based on the rate in which the file grows, a compact size ESG or DLR has about a month after an HA event before this becomes a problem. Larger sized ESGs have more memory, and hence larger tmpfs partitions. Below is an estimate based on tmpfs partition size on each size of appliance:

All DLRs (256MB tmpfs): 4 weeks
Compact ESG (256MB tmpfs): 4 weeks
Large ESG (497MB tmpfs): 8 weeks
Quad Large ESG (1024MB tmpfs): 4 months
X-Large ESG (3.9GB tmpfs): >1 year

Unfortunately, it doesn’t appear that the ha.cid.Out file can be deleted or purged while the ESG/DLR is in operation. The file is locked for editing and the only safe way to recover is to reboot the appliance. Again, all of the features including routing and packet forwarding will continue to work just fine with a full tmpfs partition. You just won’t be able to make any changes.

Disabling ESG HA will prevent this from happening, but I’d argue that being protected by HA is more important than the potential for an ESG to become unmanageable.

You can monitor your ESG’s tmpfs partition using the show system storage CLI command:

esg-lb1.vswitchzero.net-0> show system storage
Filesystem      Size   Used   Avail   Use%   Mounted on
/dev/root       444M 366M 55M 88% /
tmpfs           497M 80K 497M 1% /run
/dev/sda2        43M 2.2M 38M 6% /var/db
/dev/sda3        27M 413K 25M 2% /var/dumpfiles
/dev/sda4        32M 1.1M 29M 4% /var/log

If you see it slowly creeping up in size at a regular interval, it would be a good idea to start planning for a maintenance window to reboot the appliance.

I can’t comment on release dates, but it’s very likely that this will be fixed next release of 6.4.x, which should out very soon. The 6.3.x fix for this may be further out, so a jump to 6.4.2 may be your best bet if this proves to a serious problem for you.

I hope this is helpful.

Jumbo Frames and VXLAN Performance

VXLAN overlay technology is part of what makes software defined networking possible. By encapsulating full frames into UDP datagrams, L2 networks can be stretched across all manners of routed topologies. This breaks down the barriers of physical networking and builds the foundation for the software defined datacenter.

VXLAN or Virtual Extensible LAN is an IETF standard documented in RFC 7348. L2 over routed topologies is made possible by encapsulating entire L2 frames into UDP datagrams. About 50 bytes of outer header data is added to every L2 frame because of this, meaning that for every frame sent on a VXLAN network, both an encapsulation and de-encapsulation task must be performed. This is usually performed by ESXi hosts in software but can sometimes be offloaded to physical network adapters as well.

In a perfect world, this would be done without any performance impact whatsoever. The reality, however, is that software defined wizardry often does have a small performance penalty associated with it. This is unavoidable, but that doesn’t mean there isn’t anything that can be done to help to minimize this cost.

If you’ve been doing some performance testing, you’ve probably noticed that VMware doesn’t post statements like “You can expect X number of Gbps on a VXLAN network”. This is because there are simply too many variables to consider. Everything from NIC type, switches, drivers, firmware, offloading features, CPU count and frequency can play a role here. All these factors must be considered. From my personal experience, I can say that there is a range – albeit a somewhat wide one – of what I’d consider normal. On a modern 10Gbps system, you can generally expect more than 4Gbps but less than 7Gbps with a 1500 MTU. If your NIC supports VXLAN offloading, this can sometimes be higher than 8Gbps. I don’t think I’ve ever seen a system achieve line-rate throughput on a VXLAN backed network with a 1500 MTU regardless of the offloading features employed.

What if we can reduce the amount of encapsulation and de-encapsulation that is so taxing on our hypervisors? Today I’m going to take an in-depth look at just this – using an 8900 MTU to reduce packet rates and increase throughput. The results may surprise you!

Continue reading “Jumbo Frames and VXLAN Performance”

NSX Troubleshooting Scenario 11 – Solution

Welcome to the eleventh installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was seeing some HA heartbeat channel alarms in the new HTML5 NSX dashboard.

tshoot11a-1

After doing some digging, we were able to determine that the ESG had an interface configured for HA on VLAN 16 and that from the CLI, the edge really was complaining about being unable to reach its peer.

tshoot11a-3

You probably noticed in the first half, that the HA interface doesn’t have an IP address configured. This may look odd, but it’s fine. Even if you did specify a custom /30 IP address for HA purposes, it would not show up as an interface IP address here. Rather, you’d need to look for one specified in the HA configuration settings here:

Continue reading “NSX Troubleshooting Scenario 11 – Solution”

NSX Troubleshooting Scenario 11

Welcome to the eleventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“One of my ESXi hosts has a hardware problem. Ever since putting it into maintenance mode, I’m getting edge high availability alarms in the NSX dashboard. I think this may be a false alarm, because the two appliances are in the correct active and standby roles and not in split-brain. Why is this happening?”

A good question. This customer is using NSX 6.4.0, so the new HTML5 dashboard is what they are referring to here. Let’s see the dashboard alarms first hand.

tshoot11a-1

This is alarm code 130200, which indicates a failed HA heartbeat channel. This simply means that the two ESGs can’t talk to each other on the HA interface that was specified. Let’s have a look at edge-3, which is the ESG in question.

Continue reading “NSX Troubleshooting Scenario 11”

Using the Upgrade Coordinator in NSX 6.4

If you’ve ever gone through an NSX upgrade, you know how many components there are to upgrade. You’ve got your NSX manager appliances, control cluster, ESXi host VIBs, edges, DLR and even guest introspection appliances. In the past, every one of these needed to be upgraded independently and in the correct order.

VMware hopes to make this process a lot more straight forward with the release of the new ‘Upgrade Coordinator’ feature. This is now included as of 6.4.0 in the HTML5 client.
The aim of the upgrade coordinator is to create an upgrade plan or checklist and then to execute this in the correct order. There are many aspects of the upgrade plan than can be customized but for those looking for maximum automation – a single click upgrade option exists as well.

It is important to note that although the upgrade coordinator helps to take some of the guess work out of upgrading, there are still tasks and planning you’ll want to do ahead of time. If you haven’t already, please read my Ten Tips for a Successful NSX Upgrade post.

Today I’ll be using the upgrade coordinator to go from 6.3.3 to 6.4.0 and walk you through the process.

Upgrading NSX Manager

Although the upgrade coordinator plan covers numerous NSX components, NSX manager is not one of them. You’ll still need to use the good old manager UI upgrade process as described on page 36 of the NSX 6.4 upgrade guide. Thankfully, this is the easiest part of the upgrade.

You’ll also notice that I can use the upgrade coordinator for my lab upgrade even though I’m at a 6.3.x release currently. This is because the NSX manager is upgraded first, adding this management plane functionality to be used for the rest of the upgrade.

Note: If you are using a Cross-vCenter deployment of NSX, be sure to upgrade your primary, followed by all secondary managers before proceeding with the rest of the upgrade.

upgco-1

Upgrading NSX Manager to 6.4.x should look very familiar as the process really hasn’t changed. Be sure to heed the warning banner about taking a backup before proceeding. For more info on this, please see my Ten Tips for a Successful NSX Upgrade post.

Continue reading “Using the Upgrade Coordinator in NSX 6.4”

Missing Labels in the HTML5 Plugin with NSX 6.4.

If you recently upgraded to NSX 6.4, you are probably anxious to check out the new HTML5 plugin. VMware added some limited functionality in HTML5, including the new dashboard, upgrade coordinator as well as packet capture and support bundle collection tools. After upgrading NSX manager, you may notice that the plugin does not look the way it should. Many labels are missing. Rather than seeing tab titles like ‘Overview’ and ‘System Scale’ you see ‘dashboard.button.label.overview’ and ‘dashboard.button.label.systemScale’:

html5labels-1

Obviously, things aren’t displaying as they should be, and some views – like the upgrade coordinator – are practically unusable:

html5labels-2

Continue reading “Missing Labels in the HTML5 Plugin with NSX 6.4.”