NSX Troubleshooting Scenario 6 – Solution

As we saw in the first half of scenario 6, a fictional administrator enabled the DFW in their management cluster, which caused some unexpected filtering to occur. Their vCenter Server was no longer allowed the necessary HTTPS port 443 traffic needed for the vSphere Web Client to work.

Since we can no longer manage the environment or the DFW using the UI, we’ll need to revert this change using some other method.

As mentioned previously, we are fortunate in that NSX Manager is always excluded from DFW filtering by default. This is done to protect against this very type of situation. Because the NSX management plane is still fully functional, we should – in theory – still be able to relay API based DFW calls to NSX Manager. NSX Manager will in turn be able to publish these changes to the necessary ESXi hosts.

There are two relatively easy ways to fix this that come to mind:

  1. Use the approach outlined in KB 2079620. This is the equivalent of doing a factory reset of the DFW ruleset via API. This will wipe out all rules and they’ll need to be recovered or recreated.
  2. Use an API call to disable the DFW in the management cluster. This will essentially revert the exact change the user did in the UI that started this problem.

There are other options, but above two will work to restore HTTP/HTTPS connectivity to vCenter. Once that is done, some remediation will be necessary to ensure this doesn’t happen again. Rather than picking a specific solution, I’ll go through both of them.

NSX Troubleshooting Scenario 6

Welcome to the sixth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

NSX Troubleshooting Scenario 6

As always, we’ll start with a brief customer problem statement:

“Help! It looks like we accidentally blocked access to vCenter Server! We have two clusters, a compute and a management cluster. My colleague noticed the firewall was disabled on the management cluster and turned it on. As soon as he did that we lost all access to the vSphere Web Client.”

Well, this sounds like a ‘chicken or the egg dilemma’ – how can they recover if they can’t log in to the vSphere Web Client to revert the changes that broke things?

In speaking with our fictional customer, we learn that some rules are in place to block all HTTP/HTTPS access in the compute cluster. Because they are still deploying VMs and getting everything patched, they are using this as a temporary means to prevent all web access. Unfortunately, he can’t remember exactly what was configured in the firewall and there may be other restrictions in place.

This was a screenshot of the last thing he saw before his web client session started timing out:


Starting with some basic ping tests, we can see that the vCenter Server and NSX Manager are both still accessible from a layer-3 perspective:

Missing NSX vdrPort and Auto Deploy

If you are running Auto Deploy and noticed your VMs didn’t have connectivity after a host reboot or upgrade, you may have run into the problem described in VMware KB 52903. I’ve seen this a few times now with different customers and thought a PSA may be in order. You can find all the key details in the KB, but I thought I’d add some extra context here to help anyone who may want more information.

I recently helped to author VMware KB 52903, which has just been made public. Essentially, it describes a race condition causing a host to come up without its vdrPort connected to the distributed switch. The vdrPort is an important component on an ESXi host that funnels traffic to/from the NSX DLR module. If this port isn’t connected, traffic can’t make it to the DLR for east/west routing on that host. Technically, VMs in the same logical switches will be able to communicate across hosts, but none of the VMs on this impacted host will be able to route.

The Problem

The race condition occurs when the DVS registration of the host occurs too late in the boot process. Normally, the distributed switch should be initialized and registered long before the vdrPort gets connected. In some situations, however, DVS registration can be late. Obviously, if the host isn’t yet initialized/registered with the distributed switch, any attempt to connect something to it will fail. And this is exactly what happens.

Using the log lines from KB 52903 as an example, we can see that the host attempts to add the vdrPort to the distributed switch at 23:44:19:

2018-02-08T23:44:19.431Z error netcpa[3FFEDA29700] [Originator@6876 sub=Default] Failed to add vdr port on dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d, Not found

The reason the operation fails is because the DVS switch with the UUID specified is not found from the perspective of this host. It simply hasn’t been initialized yet. A few moments later, the DVS is finally ready for use on the host. Notice the time stamps – you can see the registration of the DVS about 9 seconds later:

2018-02-08T23:44:28.389Z info hostd[4F540B70] [Originator@6876 sub=Hostsvc.DvsTracker] Registered Dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d

The above message can be found in /var/log/hostd.log.

NSX Troubleshooting Scenario 5 – Solution

Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.

Please see the first half for more detail on the problem symptoms and some scoping.

Reader Suggestions

There were a few good suggestions from readers. Here are a couple from Twitter:

Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.

The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.

Getting Started

In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.


Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:

  • VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
  • VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
  • The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
  • The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
  • All three VMs have the correct security tag applied.
  • All three VMs are indeed showing up correctly in the security group due to the tag.
  • The two working VMs are on host esx-a1 and the broken VM is on host esx-a2

To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.

NSX Troubleshooting Scenario 5

Welcome to the fifth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

NSX Troubleshooting Scenario 5

As always, we’ll start with a brief customer problem statement:

“We’ve just deployed NSX and are doing some testing with the distributed firewall. We created a security tag that we can apply to VMs to prevent them from browsing the web. We applied this tag on three virtual machines. It seems to work on two of them, but the third can always browse the web! Something is not working here”

After speaking to the customer, we were able to collect a bit more information about the VMs and traffic flows in question. Below are the VMs that should not be able to browse:

  • win-a1.lab.local – (static)
  • lubuntu-1.lab.local – (DHCP)
  • lubuntu-2.lab.local – (DHCP)

Only the VM called lubuntu-1 is still able to browse. The others are fine. The customer has been using an internal web server called web-a1.lab.local for testing. That machine is in the same cluster and has an IP address of It serves up a web page on port 80. All of the VMs in question are sitting in the same logical switch and the customer reports that all east-west and north-south routing is functioning normally.

To begin, let’s have a look at the DFW rules defined.


As you can see, they really did just start testing as there is only one new section and a single non-default rule. The rule is quite simple. Any HTTP/HTTPS traffic coming from VMs in the ‘No Browser’ security group should be blocked. We can also see that both this rule and the default were set to log as part of the troubleshooting activities.

NSX Troubleshooting Scenario 4 – Solution

Welcome to the fourth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario four. Today I’ll be performing some troubleshooting and will resolve the issue.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

During the scoping in the first half of the scenario, we saw that the problem was squarely in the customer’s new secondary NSX deployment. Two test virtual machines – linux-r1 and linux-r2 – could not be added to any of the universal logical switches.

From the ‘Logical Switches’ view in the NSX Web Client UI, we could see that these universal logical switches were synchronized across both NSX Managers. These existed from the perspective of the Primary and Secondary manager views:


Perhaps the most telling observation, however, was the absence of distributed port groups associated with the universal logical switches on the dvs-rem switch:


As we can see above, the port groups do exist for logical switches in the VNI 900x range. These are non-universal, logical switches available to the secondary NSX deployment only.

In the host preparation section, we can see that dvs-rem is indeed the configured distributed switch for the compute-r cluster and that both hosts look good from a VTEP/VXLAN perspective:


So why are these port groups missing? Without them, VMs simply can’t be added to the associated logical switches.

The Solution

Although you’ve probably noticed that I like to dig deep in some of these scenarios, this one is actually pretty straight forward. A straight forward, but all too common problem – the cluster has not been added to the universal transport zone.


You’d be surprised how often I see this, but to be fair, it’s very easily overlooked. I sometimes need to remind myself to check all the basics first, especially when dealing with new deployments. The key symptom that raised red flags for me was the lack of auto-generated port groups on the distributed switch. The addition of the cluster to the transport zone will trigger the creation of these port groups. If they don’t exist, this should be the first thing that is checked.

As soon as I added the compute-r cluster to the Universal TZ transport zone, we see an immediate slew of portgroup creation tasks:


I’ve now essentially told NSX that I want all the logical switches in that transport zone to span to the compute-r cluster. In NSX-V, we can think of a transport zone as a boundary spanning one or more clusters. Only clusters in that transport zone will have the logical switches available to them for use.

The concept of a ‘Universal Transport Zone’ just takes this a step further and allows clusters in different vCenter instances to connect to the same universal logical switches. The fact that we saw portgroups for the 9000-900X range of VNIs tells us that the compute-r cluster existed in the non-universal Transport Zone called ‘Remote TZ’, but was missing from ‘Universal TZ’.


Thanks again to everyone for posting their testing suggestions and theories! I hope you enjoyed this scenario. If you have other suggestions for troubleshooting scenarios you’d like to see, please leave a comment, or reach out to me on Twitter (@vswitchzero).



NSX 6.4.0 Upgrade Compatibility

Thinking about upgrading to NSX 6.4.0? As I discussed in my recent Ten Tips for a Successful NSX Upgrade post, it’s always a good idea to do your research before upgrading. Along with reading the release notes, checking the VMware compatibility Matrix is essential.

VMware just updated some of the compatibility matrices to include information about 6.4.0. Here are the relevant Links:

From an NSX upgrade path perspective, you’ll be happy to learn that any current build of NSX 6.2.x or 6.3.x should be fine. At the time of writing, this would be 6.2.9 and earlier as well as 6.3.5 and earlier.


NSX upgrade compatibility – screenshot from 1/17/2018.

On a positive note, VMware required a workaround to be done for some older 6.2.x builds to go to 6.3.5, but this is no longer required for 6.4.0. The underling issue that required this has been resolved.

From a vCenter and ESXi 6.0 and 6.5 perspective, the requirements for NSX 6.4.0 remain largely unchanged from late 6.3.x releases. What you’ll immediately notice is that NSX 6.4.0 is not supported with vSphere 5.5. If you are running vSphere 5.5, you’ll need to get to at least 6.0 U2 before considering NSX 6.4.0.

From the NSX 6.4.0 release notes:

Supported: 6.0 Update 2, 6.0 Update 3
Recommended: 6.0 Update 3. vSphere 6.0 Update 3 resolves the issue of duplicate VTEPs in ESXi hosts after rebooting vCenter server. See VMware Knowledge Base article 2144605 for more information.

Supported: 6.5a, 6.5 Update 1
Recommended: 6.5 Update 1. vSphere 6.5 Update 1 resolves the issue of EAM failing with OutOfMemory. See VMware Knowledge Base Article 2135378 for more information.

Note: vSphere 5.5 is not supported with NSX 6.4.0.

It doesn’t appear that the matrix has been updated yet for other VMware products that interact with NSX, such as vCloud Director.

Before rushing out to upgrade to NSX 6.4.0, be sure to check for compatibility – especially if you are using any third party products. It may be some time before other vendors certify their products for 6.4.0.

Stay tuned for a closer look at some of the new NSX 6.4.0 features!