Cisco nenic Driver Issue During NSX Upgrades

The nenic driver versions prior to 1.0.11.0 may cause an outage during NSX upgrades.

If you are planning an NSX upgrade in a Cisco UCS environment, pay close attention to your ‘nenic’ driver version before you begin. The nenic driver is the new native driver replacement for the older vmklinux enic driver. It’s used exclusively for the Cisco VIC adapters found in UCS systems and is now the default in vSphere 6.5 and 6.7.

We’ve seen several instances now where Cisco VIC adapters can go link-down in an error state during NSX VIB upgrades. It doesn’t appear to matter what version of NSX is being upgraded from/to, but the common denominator is an older nenic driver version. This seems to be reproducible with nenic driver version 1.0.0.2 and possibly others. Version 1.0.11.0 and later appear to correct this problem. At the time of writing, 1.0.26.0 is the latest version available.

You can obtain your current nenic driver and firmware version using the following command:

# esxcli network nic get -n vmnicX

Before you upgrade your drivers, be sure to reach out to Cisco to ensure your firmware is also at the recommended release version. Quite often vendors have a recommended driver/firmware combination for maximum stability and performance.

I expect a KB article and an update to the NSX release notes to be made public soon but wanted to ensure this information got out there as soon as possible.

NSX 6.4.3 Now Available!

Express maintenance release fixes two discovered issues.

If it feels like 6.4.2 was just released, you’d be correct – only three weeks ago. The new 6.4.3 release (build 9927516) is what’s referred to as an express maintenance release. These releases aim to correct specific customer identified problems as quickly as possible rather than having to wait many months for the next full patch release.

In this release, only two identified bugs have been fixed. The first is an SSO issue that can occur in environments with multiple PSCs:

“Fixed Issue 2186945: NSX Data Center for vSphere 6.4.2 will result in loss of SSO functionality under specific conditions. NSX Data Center for vSphere cannot connect to SSO in an environment with multiple PSCs or STS certificates after installing or upgrading to NSX Data Center for vSphere 6.4.2.”

The second is an issue with IPsets that can impact third party security products – like Palo Alto Networks and Checkpoint Net-X services for example:

“Issue 2186968: Static IPset not reported to containerset API call. If you have service appliances, NSX might omit IP sets in communicating with Partner Service Managers. This can lead to partner firewalls allowing or denying connections incorrectly. Fixed in 6.4.3.”

You can find more information on these problems in VMware KB 57770 and KB 57834.

So knowing that these are the only two fixes included, the question obviously becomes – do I really need to upgrade?

If you are running 6.4.2 today, you might not need to. If you have more than one PSC associated with the vCenter Server that NSX manager connects to, or if you use third party firewall products that work in conjunction with NSX, the answer would be yes. If you don’t, there is really no benefit to upgrading to 6.3.4 and it would be best to save your efforts for the next major release.

That said, if you were already planning an upgrade to 6.4.2, it only makes sense to go to 6.4.3 instead. You’d get all the benefits of 6.4.2 plus these two additional fixes.

Kudos goes out to the VMware NSBU engineering team for their quick work in getting these issues fixed and getting 6.4.3 out so quickly.

Relevant Links:

 

VMware Tools 10.3.2 Now Available

New bundled VMXNET3 driver corrects PSOD crash issue.

As mentioned in a recent post, a problem in the tools 10.3.0 bundled VMXNET3 driver could cause host PSODs and connectivity issues. As of September 12th, VMware Tools 10.3.2 is now available, which corrects this issue.

The problematic driver was version 1.8.3.0 in tools 10.3.0. According to the release notes, it has been replaced with version 1.8.3.1. In addition to this fix, there are four resolved issues listed as well.

VMware mentions the following in the 10.3.2 release notes:

Note: VMware Tools 10.3.0 is deprecated due to a VMXNET3 driver related issue. For more information, see KB 57796. Install VMware Tools 10.3.2, or VMware Tools 10.2.5 or an earlier version of VMware Tools.”

Kudos to the VMware engineering teams for getting 10.3.2 released so quickly after the discovery of the problem!

Relevant links:

 

PSOD and Connectivity Problems with VMware Tools 10.3.0

Downgrading to Tools 10.2.5 is an effective workaround.

If you have installed the new VMware Tools 10.3.0 release in VMs running recent versions of Windows, you may be susceptible to host PSODs and other general connectivity problems. VMware has just published KB 57796 regarding this problem, and has recalled 10.3.0 so that it’s no longer available for download.

Tools 10.3.0 includes a new version of the VMXNET3 vNIC driver – version 1.8.3.0 – for Windows, which seems to be the primary culprit. Thankfully, not every environment with Tools 10.3.0 will run into this. It appears that the following conditions must be met:

  1. You are running a build of ESXi 6.5.
  2. You have Windows 2012, Windows 8 or later VMs with VMXNET3 adapters.
  3. The VM hardware is version 13 (the version released along with vSphere 6.5).
  4. Tools 10.3.0 with the 1.8.3.0 VMXNET3 driver is installed in the Windows guests.

VMware is planning to have this issue fixed in the next release of Tools 10.3.x.

If you fall into the above category and are at risk, it would be a good idea to address this even if you haven’t run into any problems. Since this issue is specific to VMXNET3 version 1.8.3.0 – which is bundled only with Tools 10.3.0 – downgrading to Tools 10.2.5 is an effective workaround. Simply uninstall tools, and re-install version 10.2.5, which is available here.

Another option would be to replace VMXNET3 adapters with E1000E based adapters in susceptible VMs. I would personally rather downgrade to Tools 10.2.5 as both of these actions would cause VM impact and the VMXNET3 adapter is far superior.

Again, you’d only need to do this for VMs that fall into the specific categories listed above. Other VMs can be left as-is running 10.3.0 without concern.

On a positive note, Tools 10.3.0 hasn’t been bundled with any builds of ESXi 6.5, so unless you’ve gone out and obtained tools directly from the VMware download page recently, you shouldn’t have it in your environment.

A New Look

A new theme that’s easier to read and mobile friendly.

You may have noticed that the blog has a bit of a fresh new look as of late. When I had originally started the site over a year ago, I used the trusty ‘twenty twelve’ wordpress theme for it’s simplicity and ease of use. Although I liked its simple layout, it was dated and wasn’t particularly mobile-friendly. I’ve since moved over to ‘twenty sixteen’, which is much more customizable, easier to read and works a lot better on mobile devices. I hope that this will be a positive change for the site.

Please be patient with me over the next few days as I iron out the quirks and get the CSS styling to behave. I noticed that some of the images are not aligning correctly, among other things. Thanks for your patience!

NSX Controller Issues with vRNI 3.8

Just a quick PSA to let everyone know that vRNI 3.8 Patch 4 (201808101430) is now available for download. If you are running vRNI 3.8 with NSX, be sure to patch as soon as possible or disable controller polling based on instructions in the workaround section of KB 57311.

Some changes were made in vRNI 3.8 for NSX controller polling. It now uses both the NSX Central CLI as well as SSH sessions to obtain statistics and information. In some situations, excessive numbers of SSH sessions are left open and memory exhaustion on controller nodes can occur.

If you do find a controller in this state, a reboot will get it back to normal again. Just be sure to keep a close eye on your control cluster. If two or three go down, you’ll lose controller majority and will likely run into control plane and data plane problems for VMs on logical switches or VMs using DLR routing.

More information on the NSX controller out-of-memory issue can be found in VMware KB 57311. For more information on vRNI 3.8 Patch 4 – including how to install it – see VMware KB 57683.

NSX 6.4.2 Now Available!

It’s always an exciting day when a new build of NSX is released. As of August 21st, NSX 6.4.2 (Build 9643711) is now available for download. Despite being just a ‘dot’ release, VMware has included numerous functional enhancements in addition to the usual bug fixes.

One of the first things you’ll probably notice is that VMware is now referring to ‘NSX for vSphere’ as ‘NSX Data Center for vSphere’. I’m not sure the name has a good ring to it, but we’ll go with that.

A few notable new features:

More HTML5 Enabled Features: It seems VMware is adding additional HTML5 functionality with each release now. In 6.4.2, we can now access the TraceFlow, User Domains, Audit Logs and Tasks and Events sections from the HTML5 client.

Multicast Support: This is a big one. NSX 6.4.2 now supports both IGMPv2 and PIM Sparse on both DLRs and ESGs. I hope to take a closer look at these changes in a future post.

MAC Limit Increase: Traditionally, we’ve always recommended limiting each logical switch to a /22 or smaller network to avoid exceeding the 2048 MAC entry limit. NSX 6.4.2 now doubles this to 4096 entries per logical switch.

L7 Firewall Improvements: Additional L7 contexts added including EPIC, Microsoft SQL and BLAST AppIDs.

Firewall Rule Hit Count: Easily see which firewall rules are being hit, and which are not with these counters.

Firewall Section Locking: Great to allow multiple people to work on the DFW simultaneously without conflicting with eachother.

Additional Scale Dashboard Metrics: There are 25 new metrics added to ensure you stay within supported limits.

Controller NTP, DNS and Syslog: This is now finally exposed in the UI and fully supported. As someone who is frequently looking at log bundles, it’ll be nice to finally be able to have accurate time keeping on controller nodes.

On the resolved issues front, I’m happy to report that 6.4.2 includes 21 documented bug fixes. You can find the full list in the release notes, but a couple of very welcome ones include issues 2132361 and 2147002. Those who are using Guest Introspection on NSX 6.4.1 should consider upgrading as soon as possible due to the service and performance problems outlined in KB 56734. NSX 6.4.2 corrects this problem.

Another issue not listed in the release notes – it may have been missed – but fixed in 6.4.2 is the full ESG tmpfs partition issue in 6.4.1. You can find more information on this issue in KB 57003 as well as in a recent post I did on it here.

Here are the relevant links for NSX 6.4.2 (Build 9643711):

I’m looking forward to getting my lab upgraded and will be trying out a few of these new features. Remember, if you are planning to upgrade, be sure to do the necessary planning and preparation.

ESG/DLR tmpfs partition fills in NSX 6.3.6 and 6.4.1

If you are running NSX 6.3.6 or 6.4.1, you should take a close look at VMware KB 57003. A newly discovered issue can result in the tmpfs partition of DLRs and ESGs from filling up, rendering the appliances unmanageable.

On a positive note, there should be no datapath impact because of a full tmpfs partition. You just won’t be able to push any configuration changes to the ESG or DLR in this state.
This occurs because of a file related to HA in /run that will slowly grow until it fills the partition. The file in question is ‘ha.cid.Out’ and contains HA diagnostic information. You can find it in the /run/vmware/vshield/cmdOut directory.

If you have a very stable environment, it’s quite possible that you’ll never run into this problem. The ha.cid.Out file is created and updated only after an HA event occurs – like a failover or split-brain recovery for example. Once the file is created, however, it receives regular updates and will inevitably grow.

Based on the rate in which the file grows, a compact size ESG or DLR has about a month after an HA event before this becomes a problem. Larger sized ESGs have more memory, and hence larger tmpfs partitions. Below is an estimate based on tmpfs partition size on each size of appliance:

All DLRs (256MB tmpfs): 4 weeks
Compact ESG (256MB tmpfs): 4 weeks
Large ESG (497MB tmpfs): 8 weeks
Quad Large ESG (1024MB tmpfs): 4 months
X-Large ESG (3.9GB tmpfs): >1 year

Unfortunately, it doesn’t appear that the ha.cid.Out file can be deleted or purged while the ESG/DLR is in operation. The file is locked for editing and the only safe way to recover is to reboot the appliance. Again, all of the features including routing and packet forwarding will continue to work just fine with a full tmpfs partition. You just won’t be able to make any changes.

Disabling ESG HA will prevent this from happening, but I’d argue that being protected by HA is more important than the potential for an ESG to become unmanageable.

You can monitor your ESG’s tmpfs partition using the show system storage CLI command:

esg-lb1.vswitchzero.net-0> show system storage
Filesystem      Size   Used   Avail   Use%   Mounted on
/dev/root       444M 366M 55M 88% /
tmpfs           497M 80K 497M 1% /run
/dev/sda2        43M 2.2M 38M 6% /var/db
/dev/sda3        27M 413K 25M 2% /var/dumpfiles
/dev/sda4        32M 1.1M 29M 4% /var/log

If you see it slowly creeping up in size at a regular interval, it would be a good idea to start planning for a maintenance window to reboot the appliance.

I can’t comment on release dates, but it’s very likely that this will be fixed next release of 6.4.x, which should out very soon. The 6.3.x fix for this may be further out, so a jump to 6.4.2 may be your best bet if this proves to a serious problem for you.

I hope this is helpful.

One-Year Anniversary of vswitchzero

It’s hard to believe, but a full year has passed since I wrote my first blog post on vswitchzero.com. My first post was something very simple just to get used to the authoring process – suppressing shell warnings – written on June 3rd, 2017.

When I started, my goal was to share my knowledge with the community and to share some of the other things I enjoy as well. I really wasn’t sure if I’d keep up with it or enjoy the process, but it has turned out to be a great personal and professional experience for me. I find myself digging deeper into problems and technologies, and looking for new ways to share, challenge and educate. It’s also been a great outlet for me to share some of my hobbies – like building and restoring retro PCs.

To date, I’ve written 76 posts for an average of about a post and a half per week. When I was getting started, it took some time to get in the swing of releasing regular content. Now, it seems I never have fewer than two or three things on my mind to write about, which is great.

Over the last year, I’ve seen a steady rise in my visitor and view counts and have seen many of my posts work their way up in the google search rankings. Some of them have been surprisingly popular – like my post on VMXNET3 buffer exhaustion and the beacon probing deep dive. I’ve also gotten some really positive feedback on my ongoing NSX Troubleshooting Scenario posts, which I hope to continue with. Being recognized as a 2018 vExpert was also a big milestone for me and I look forward to applying for the 2018 NSX vExpert program as well.

I’d like to take a moment to thank William Lam (@lamw), and Matt Mancini (@vmexplorer) who were a big help in getting me started. They provided me with many great tips. Some of which I have embraced, and others that I still struggle with – like trying not to write 15,000-word posts. They also encouraged me to get on Twitter, which has proven to be an excellent tool to share my posts with the greater community.

Thank you all for your support and encouragement! I look forward to the many posts ahead.

NSX 6.4.1 Now Available!

On May 24th, VMware released NSX 6.4.1 – the first version of NSX to support vSphere 6.7. This is undoubtedly exciting news for those who have been waiting to upgrade their vSphere deployment. Although 6.4.1 sounds like a minor release, there are a slew of UI and usability enhancements as well as context-aware firewall improvements. There has also been some additional functionality introduced into the HTML5 client, which is very welcome news.

You’ll also notice in 6.4.1 that the service composer canvas view has been removed. This was a bit of an iconic overview page for service composer in the UI, but was not terribly useful and didn’t scale at all to large deployments with many security policies. I honestly don’t think anyone will be missing it.

On top of these enhancements, VMware engineering has been busy with bug fixes. NSX 6.4.1 includes 23 documented fixes across all areas of the product. A couple of notable ones include:

  • Fixed Issue 2035026: Network outage of around 40-50 seconds seen on Edge Upgrade
  • Fixed Issue 1971683: NSX Manager logs false duplicate IP message
  • Fixed Issue 2092730: NSX Edge stops responding with /var/log partition at 100% disk usage
  • Fixed Issue 1809387: Support for Weak Secure transport protocol – TLS v1.0 removed

You can find the complete list in the resolved issues section of the NSX 6.4.1 release notes.

Planning to upgrade? Remember to check the NSX upgrade matrix. Those running 6.2.0, 6.2.1 or 6.2.2 will need to refer to KB 51624 before upgrading. Have a look at my ‘Ten Tips for a Successful NSX Upgrade‘ post for ways to ensure your upgrade is successful.

Relevant Links for NSX 6.4.1 (Build 8599035):