Cisco nenic Driver Issue During NSX Upgrades

The nenic driver versions prior to 1.0.11.0 may cause an outage during NSX upgrades.

If you are planning an NSX upgrade in a Cisco UCS environment, pay close attention to your ‘nenic’ driver version before you begin. The nenic driver is the new native driver replacement for the older vmklinux enic driver. It’s used exclusively for the Cisco VIC adapters found in UCS systems and is now the default in vSphere 6.5 and 6.7.

We’ve seen several instances now where Cisco VIC adapters can go link-down in an error state during NSX VIB upgrades. It doesn’t appear to matter what version of NSX is being upgraded from/to, but the common denominator is an older nenic driver version. This seems to be reproducible with nenic driver version 1.0.0.2 and possibly others. Version 1.0.11.0 and later appear to correct this problem. At the time of writing, 1.0.26.0 is the latest version available.

You can obtain your current nenic driver and firmware version using the following command:

# esxcli network nic get -n vmnicX

Before you upgrade your drivers, be sure to reach out to Cisco to ensure your firmware is also at the recommended release version. Quite often vendors have a recommended driver/firmware combination for maximum stability and performance.

I expect a KB article and an update to the NSX release notes to be made public soon but wanted to ensure this information got out there as soon as possible.

NSX Troubleshooting Scenario 12 – Solution

Welcome to the twelfth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was getting some unexpected behavior from a couple of firewall rules. Despite the rules being properly constructed, one VM called linux-a3 continued to be accessible via SSH.

tshoot12a-2
The two rules in question – 1007 and 1008 – look to be constructed correctly.

We confirmed that the IP addresses for the machines in the security group where translated correctly by NSX and that the ruleset didn’t appear to be the problem. Let’s recap what we know:

  1. VM linux-a2 seems to be working correctly and SSH traffic is blocked.
  2. VM linux-a3 doesn’t seem to respect rule 1007 for some reason and remains accessible via SSH from everywhere.
  3. Host esx-a3 where linux-a3 resides doesn’t appear to log any activity for rule 1007 or 1008 even though those rules are configured to log.
  4. The two VMs are on different ESXi hosts (esx-a1 and esx-a3).
  5. VMs linux-a2 and linux-a3 are in different dvPortgroups.

Given these statements, there are several things I’d want to check:

  1. How can the two VMs have proper IP connectivity in VXLAN and VLAN porgroups as observed?
  2. Is the DFW working at all on host esx-a3?
  3. Did the last rule publication make it to host esx-a3 and does it match what we see in the UI?
  4. Is the DFW (slot-2) dvfilter applied to linux-a3 correctly?

Continue reading “NSX Troubleshooting Scenario 12 – Solution”

NSX Troubleshooting Scenario 12

Welcome to the twelfth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

For this scenario today, I’ve created some supplementary video content to go along this post:

The Scenario

As always, we’ll start with a brief problem statement:

“I am just getting started with the NSX distributed firewall and see that the rules are not behaving as they should be. I have two VMs, linux-a2 and linux-a3 that should allow SSH from only one specific jump box. The linux-a3 VM can be accessed via SSH from anywhere! Why is this happening?”

To get started with this scenario, we’ll most certainly need to look at how the DFW rules are constructed to get the desired behavior. The immense flexibility of the distributed firewall allows for dozens of different ways to achieve what is described.

Here are the two VMs in question:

tshoot12a-4
The VM linux-a2 is currently on host esx-a1 with IP address 172.16.15.10. It’s sitting on a logical switch.

And linux-a3:

tshoot12a-5
The VM linux-a3 is currently on host esx-a3 with IP address 172.16.15.11. It’s sitting on a VLAN backed dvPortgroup.

There are a couple of interesting observations above. The first is that both VMs have a security tag applied called ‘Linux-A VMs’. The other is a bit more of an oddity – one VM is in a distributed switch VLAN backed portgroup called dvpg-a-vlan15, and the other is in a VXLAN backed logical switch. Despite this, both VMs are in the same 172.16.15.0/24 subnet.

Continue reading “NSX Troubleshooting Scenario 12”

Removing All Universal Objects using PowerNSX and PowerShell Scripting

A requirement for a transit NSX manager appliance before it can assume the standalone role.

Cross-vCenter NSX is an awesome feature. Introduced back with NSX 6.2, it breaks down barriers and allows the spanning of logical networks well beyond what was possible before. Adding and synchronizing additional secondary NSX managers and their associated vCenter Servers is a relatively simple process and generally just works. But what about moving a secondary back to a standalone manager? Not quite so simple, unfortunately.

The process should be straight forward – disconnect all the VMs from the universal logical switches, make the secondary manager a standalone and then go through the documented removal process.

From a high level, that’s correct, but you’ll be stopped quickly in your tracks by the issue I outlined in NSX Troubleshooting Scenario 7. Before a secondary NSX manager can become a standalone, all universal objects must be removed from it.

uniremove-1

Now, assuming Cross-VC NSX will still be used and there will be other secondary managers that still exist after this one is removed, we don’t want to completely remove all universal objects as they’ll still be used by the primary and other secondaries.

In this situation, the process to get a secondary NSX manager back to a standalone would look something like this:

  1. Remove all VMs from universal logical switches at the secondary location.
  2. Use the ‘Remove Secondary Manager’ option from the ‘Installation and Upgrade’ section in the vSphere Client. This will change the secondary’s role to ‘Transit’ and effectively stops all universal synchronization with the primary.
  3. Remove all universal objects from the new ‘Transit’ NSX manager. These universal objects were originally all created on the primary manager and synchronized to this one while it was a secondary.
  4. Once all universal objects have been removed from the ‘Transit’ manager, it’s role can be changed to ‘Standalone’.

Much of the confusion surrounding this process revolves around a transitional NSX manager role type aptly named ‘Transit’. When a manager assumes the ‘Transit’ role, it is effectively disconnected from the primary and all universal synchronization to it stops. Even though it won’t synchronize, all the universal objects are preserved. This is done because Cross-VC NSX is designed to allow any of the secondary NSX managers to assume the primary role if necessary.

uniremove-2

Continue reading “Removing All Universal Objects using PowerNSX and PowerShell Scripting”

Limiting User Scope and Permissions in NSX

Using REST API calls to limit NSX user permissions to specific objects only.

There is a constant stream of new features added with each release of NSX, but not all of the original features have survived. NSX Data Security was one such feature, but VMware also removed the ‘Limit Scope’ option for user permissions in the NSX UI with the release of 6.2.0 back in 2015. Every so often I’ll get a customer asking where this feature went.

The ‘Limit Scope’ feature allows you to limit specific NSX users to specific objects within the inventory. For example, you may want to provide an application owner with full access to only one specific edge load balancer, and to ensure they have access to nothing else in NSX.

The primary reason that the feature was scrapped in 6.2 was because of UI problems that would occur for users restricted to only specific resources. To view the UI properly and as intended, you’d need access to the ‘global root’ object that is the parent for all other NSX managed objects. VMware KB 2136534 is about the only source I could find that discusses this.

REST API Calls Still Exist

Although the ‘Limit Scope’ option was removed from the UI in 6.2 and later, you may be surprised to discover that the API calls for this feature still exist.

To show how this works, I’ll be running through a simple scenario in my lab. For this test, we’ll assume that there are two edges – mercury-esg1 and mercury-dlr that are related to a specific application deployment. A vCenter user called test in the vswitchzero.net domain requires access to these two edges, but we don’t want them to be able to access anything else.

limitscope-1
We want to limit access to only edge-4 and edge-5 for the ‘test’ user.

The two edges in question have morefs edge-4 and edge-5 respectively. For more information on finding moref IDs for NSX objects, see my post on the subject here.

Continue reading “Limiting User Scope and Permissions in NSX”

NSX 6.4.3 Now Available!

Express maintenance release fixes two discovered issues.

If it feels like 6.4.2 was just released, you’d be correct – only three weeks ago. The new 6.4.3 release (build 9927516) is what’s referred to as an express maintenance release. These releases aim to correct specific customer identified problems as quickly as possible rather than having to wait many months for the next full patch release.

In this release, only two identified bugs have been fixed. The first is an SSO issue that can occur in environments with multiple PSCs:

“Fixed Issue 2186945: NSX Data Center for vSphere 6.4.2 will result in loss of SSO functionality under specific conditions. NSX Data Center for vSphere cannot connect to SSO in an environment with multiple PSCs or STS certificates after installing or upgrading to NSX Data Center for vSphere 6.4.2.”

The second is an issue with IPsets that can impact third party security products – like Palo Alto Networks and Checkpoint Net-X services for example:

“Issue 2186968: Static IPset not reported to containerset API call. If you have service appliances, NSX might omit IP sets in communicating with Partner Service Managers. This can lead to partner firewalls allowing or denying connections incorrectly. Fixed in 6.4.3.”

You can find more information on these problems in VMware KB 57770 and KB 57834.

So knowing that these are the only two fixes included, the question obviously becomes – do I really need to upgrade?

If you are running 6.4.2 today, you might not need to. If you have more than one PSC associated with the vCenter Server that NSX manager connects to, or if you use third party firewall products that work in conjunction with NSX, the answer would be yes. If you don’t, there is really no benefit to upgrading to 6.3.4 and it would be best to save your efforts for the next major release.

That said, if you were already planning an upgrade to 6.4.2, it only makes sense to go to 6.4.3 instead. You’d get all the benefits of 6.4.2 plus these two additional fixes.

Kudos goes out to the VMware NSBU engineering team for their quick work in getting these issues fixed and getting 6.4.3 out so quickly.

Relevant Links:

 

Manual Upgrade of NSX Host VIBs

Complete manual control of the NSX host VIB upgrade process without the use of vSphere DRS.

NSX host upgrades are well automated these days. By taking advantage of ‘fully automated’ DRS, hosts in a cluster can be evacuated, put in maintenance mode, upgraded, and even rebooted without any user intervention. By relying on DRS for resource scheduling, NSX doesn’t have to worry about doing too many hosts simultaneously and the process can generally be done without end-users even noticing.

But what if you don’t want this level of automation? Maybe you’ve got very sensitive VMs that can’t be migrated, or VMs pinned to hosts for some reason. Or maybe you just want maximum control of the upgrade process and which hosts are upgraded – and when.

There is no reason why you can’t have full control of the host upgrade process and leave DRS in manual mode. This is indeed supported.

Most of the documentation and guides out there assume that people will want to take advantage of DRS-driven upgrades, but this doesn’t mean it’s the only supported method. There is no reason why you can’t have full control of the host upgrade process and this is indeed supported. Today I’ll be walking through this in my lab as I upgrade to NSX 6.4.1.

Step 1 – Clicking the Upgrade Link

Once you’ve upgraded your NSX manager and control cluster, you should be ready to begin tackling your ESXi host clusters. Before you proceed, you’ll need to ensure your host clusters have DRS set to ‘Manual’ mode. Don’t disable DRS – that will get rid of your resource pools. Manual mode is sufficient.

Next, you’ll need to browse to the usual ‘Installation’ section in the UI, and click on the ‘Host Preparation’ tab. From here, it’s now safe to click the ‘Upgrade Available’ link on the cluster to begin the upgrade process. Because DRS is in manual mode, nothing will be able happen. Hosts can’t be evacuated, and as a result, VIBs can’t be upgraded. In essence, the upgrade has started, but immediately stalls and awaits manual intervention.

 

upgnodrs-3
This upgrade is essentially hung up waiting for hosts to enter maintenance mode.

 

In 6.4.1 as shown above, a clear banner message is displayed reminding you that DRS is in manual mode and that hosts must be manually put in maintenance mode.

Continue reading “Manual Upgrade of NSX Host VIBs”

NSX Controller Issues with vRNI 3.8

Just a quick PSA to let everyone know that vRNI 3.8 Patch 4 (201808101430) is now available for download. If you are running vRNI 3.8 with NSX, be sure to patch as soon as possible or disable controller polling based on instructions in the workaround section of KB 57311.

Some changes were made in vRNI 3.8 for NSX controller polling. It now uses both the NSX Central CLI as well as SSH sessions to obtain statistics and information. In some situations, excessive numbers of SSH sessions are left open and memory exhaustion on controller nodes can occur.

If you do find a controller in this state, a reboot will get it back to normal again. Just be sure to keep a close eye on your control cluster. If two or three go down, you’ll lose controller majority and will likely run into control plane and data plane problems for VMs on logical switches or VMs using DLR routing.

More information on the NSX controller out-of-memory issue can be found in VMware KB 57311. For more information on vRNI 3.8 Patch 4 – including how to install it – see VMware KB 57683.

NSX 6.4.2 Now Available!

It’s always an exciting day when a new build of NSX is released. As of August 21st, NSX 6.4.2 (Build 9643711) is now available for download. Despite being just a ‘dot’ release, VMware has included numerous functional enhancements in addition to the usual bug fixes.

One of the first things you’ll probably notice is that VMware is now referring to ‘NSX for vSphere’ as ‘NSX Data Center for vSphere’. I’m not sure the name has a good ring to it, but we’ll go with that.

A few notable new features:

More HTML5 Enabled Features: It seems VMware is adding additional HTML5 functionality with each release now. In 6.4.2, we can now access the TraceFlow, User Domains, Audit Logs and Tasks and Events sections from the HTML5 client.

Multicast Support: This is a big one. NSX 6.4.2 now supports both IGMPv2 and PIM Sparse on both DLRs and ESGs. I hope to take a closer look at these changes in a future post.

MAC Limit Increase: Traditionally, we’ve always recommended limiting each logical switch to a /22 or smaller network to avoid exceeding the 2048 MAC entry limit. NSX 6.4.2 now doubles this to 4096 entries per logical switch.

L7 Firewall Improvements: Additional L7 contexts added including EPIC, Microsoft SQL and BLAST AppIDs.

Firewall Rule Hit Count: Easily see which firewall rules are being hit, and which are not with these counters.

Firewall Section Locking: Great to allow multiple people to work on the DFW simultaneously without conflicting with eachother.

Additional Scale Dashboard Metrics: There are 25 new metrics added to ensure you stay within supported limits.

Controller NTP, DNS and Syslog: This is now finally exposed in the UI and fully supported. As someone who is frequently looking at log bundles, it’ll be nice to finally be able to have accurate time keeping on controller nodes.

On the resolved issues front, I’m happy to report that 6.4.2 includes 21 documented bug fixes. You can find the full list in the release notes, but a couple of very welcome ones include issues 2132361 and 2147002. Those who are using Guest Introspection on NSX 6.4.1 should consider upgrading as soon as possible due to the service and performance problems outlined in KB 56734. NSX 6.4.2 corrects this problem.

Another issue not listed in the release notes – it may have been missed – but fixed in 6.4.2 is the full ESG tmpfs partition issue in 6.4.1. You can find more information on this issue in KB 57003 as well as in a recent post I did on it here.

Here are the relevant links for NSX 6.4.2 (Build 9643711):

I’m looking forward to getting my lab upgraded and will be trying out a few of these new features. Remember, if you are planning to upgrade, be sure to do the necessary planning and preparation.

ESG/DLR tmpfs partition fills in NSX 6.3.6 and 6.4.1

If you are running NSX 6.3.6 or 6.4.1, you should take a close look at VMware KB 57003. A newly discovered issue can result in the tmpfs partition of DLRs and ESGs from filling up, rendering the appliances unmanageable.

On a positive note, there should be no datapath impact because of a full tmpfs partition. You just won’t be able to push any configuration changes to the ESG or DLR in this state.
This occurs because of a file related to HA in /run that will slowly grow until it fills the partition. The file in question is ‘ha.cid.Out’ and contains HA diagnostic information. You can find it in the /run/vmware/vshield/cmdOut directory.

If you have a very stable environment, it’s quite possible that you’ll never run into this problem. The ha.cid.Out file is created and updated only after an HA event occurs – like a failover or split-brain recovery for example. Once the file is created, however, it receives regular updates and will inevitably grow.

Based on the rate in which the file grows, a compact size ESG or DLR has about a month after an HA event before this becomes a problem. Larger sized ESGs have more memory, and hence larger tmpfs partitions. Below is an estimate based on tmpfs partition size on each size of appliance:

All DLRs (256MB tmpfs): 4 weeks
Compact ESG (256MB tmpfs): 4 weeks
Large ESG (497MB tmpfs): 8 weeks
Quad Large ESG (1024MB tmpfs): 4 months
X-Large ESG (3.9GB tmpfs): >1 year

Unfortunately, it doesn’t appear that the ha.cid.Out file can be deleted or purged while the ESG/DLR is in operation. The file is locked for editing and the only safe way to recover is to reboot the appliance. Again, all of the features including routing and packet forwarding will continue to work just fine with a full tmpfs partition. You just won’t be able to make any changes.

Disabling ESG HA will prevent this from happening, but I’d argue that being protected by HA is more important than the potential for an ESG to become unmanageable.

You can monitor your ESG’s tmpfs partition using the show system storage CLI command:

esg-lb1.vswitchzero.net-0> show system storage
Filesystem      Size   Used   Avail   Use%   Mounted on
/dev/root       444M 366M 55M 88% /
tmpfs           497M 80K 497M 1% /run
/dev/sda2        43M 2.2M 38M 6% /var/db
/dev/sda3        27M 413K 25M 2% /var/dumpfiles
/dev/sda4        32M 1.1M 29M 4% /var/log

If you see it slowly creeping up in size at a regular interval, it would be a good idea to start planning for a maintenance window to reboot the appliance.

I can’t comment on release dates, but it’s very likely that this will be fixed next release of 6.4.x, which should out very soon. The 6.3.x fix for this may be further out, so a jump to 6.4.2 may be your best bet if this proves to a serious problem for you.

I hope this is helpful.