Troubleshooting – Page 3

NSX Troubleshooting Scenario 9

Welcome to the ninth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“We’re in the process of deploying NSX. We were able to deploy the NSX Manager and Control Cluster, but every time we try to install the VIBs on the host, it fails with a licensing error. We have already added the license for NSX Enterprise in vCenter!”

Every time the customer tries to prepare cluster compute-a, they get the following error:

The exact error is:

“Operation is not allowed by the applied NSX license.”

Looking in the most obvious spot, we can see that the customer had indeed added a license for ‘NSX for vSphere – Enterprise’. Not only that, but there is also an ‘NSX for vShield Endpoint’ license.

Continue reading “NSX Troubleshooting Scenario 9”

NSX Troubleshooting Scenario 8 – Solution

Welcome to the eighth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario eight. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

In the first half of scenario 8, we saw that our fictional administrator was getting an error message while trying to deploy the first of three controller nodes.

The exact error was:

“Waiting for NSX controller ready controller-1 failed in deployment – Timeout on waiting for controller ready.”

Unfortunately, this doesn’t tell us a whole lot aside from the fact that the manager was waiting and eventually gave up.

tshoot8a-7

Now, before we begin troubleshooting, we should first think about the normal process for controller deployment. What exactly happens behind the scenes?

The necessary inputs are provided via the vSphere Client or REST API (i.e. deployment information like datastore, IP Pool etc).
NSX Manager then deploys a controller OVF template that is stored on it’s local filesystem. It does this using vSphere API calls via its inventory tie-in with vCenter Server.
Once the OVF template is deployed, it will be powered on.
During initial power on, the machine will receive an IP address, either via DHCP or via the pool assignment.
Once the controller node has booted, NSX Manager will begin to push the necessary configuration information to it via REST API calls.
Once the controller node is up, and is able to serve requests and communicate with NSX Manager, the deployment is considered successful and the status in the UI changes from ‘Deploying’ to ‘Connected’

Let’s have a look at the NSX Manager logging to see if we can get more information:

Continue reading “NSX Troubleshooting Scenario 8 – Solution”

NSX Troubleshooting Scenario 8

Welcome to the eighth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“I’m doing a new greenfield deployment of NSX and my control cluster is failing to deploy. It seems stuck at ‘Deploying’ and then after a long period of time, it gives me a failure and the appliance gets deleted.”

Let’s have a look and see what this fictional administrator is seeing:

tshoot8a-2

We can see that they’ve successfully deployed NSX Manager at version 6.3.2 and have no controllers successfully deployed yet.

tshoot8a-3

A valid looking IP pool has been created for the controllers with all the pertinent IP settings populated. The controller deployment is being done with the following settings:

Continue reading “NSX Troubleshooting Scenario 8”

NSX Troubleshooting Scenario 7 – Solution

Welcome to the seventh installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario seven. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

In the first half of this scenario, we saw that our fictional customer was hitting an exception every time they tried to convert their secondary – now a transit – NSX Manager to the standalone role. The error message seemed to imply that numerous universal objects were still in the environment.

tshoot7a-2

Our quick spot checks didn’t show any lingering universal objects, but looking at the NSX Manager logging can tell us a bit more about what still exists:

2018-03-26 22:27:21.779 GMT  INFO http-nio-127.0.0.1-7441-exec-1 ReplicationConfigurationServiceImpl:152 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Role validation successful
2018-03-26 22:27:21.792 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 1 universal objects exists for type VdnScope
2018-03-26 22:27:21.793 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): Universal TZ
2018-03-26 22:27:21.794 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 1 universal objects exists for type Edge
2018-03-26 22:27:21.797 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): dlr-universal
2018-03-26 22:27:21.798 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 5 universal objects exists for type VirtualWire
2018-03-26 22:27:21.806 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): Universal Transit, Universal Test, Universal App, Universal Web, Universal DB
2018-03-26 22:27:21.809 GMT  INFO http-nio-127.0.0.1-7441-exec-1 L2UniversalSyncListenerImpl:58 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Global VNI pool exists
2018-03-26 22:27:21.814 GMT  WARN http-nio-127.0.0.1-7441-exec-1 ReplicationConfigurationServiceImpl:101 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Setting role to TRANSIT because following object types have universal objects VniPool,VdnScope,Edge,VirtualWire
2018-03-26 22:27:21.816 GMT  INFO http-nio-127.0.0.1-7441-exec-1 AuditingServiceImpl:174 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] [AuditLog] UserName:'LAB\mike', ModuleName:'UniversalSync', Operation:'ASSIGN_STANDALONE_ROLE', Resource:'', Time:'Mon Mar 26 14:27:21.815 GMT 2018', Status:'FAILURE', Universal Object:'false'
2018-03-26 22:27:21.817 GMT  WARN http-nio-127.0.0.1-7441-exec-1 RemoteInvocationTraceInterceptor:88 - Processing of VsmHttpInvokerServiceExporter remote call resulted in fatal exception: com.vmware.vshield.vsm.replicator.configuration.facade.ReplicatorConfigurationFacade.setAsStandalonere.vshield.vsm.exceptions.InvalidArgumentException: core-services:125023:Unable to assign STANDALONE role. Universal objects of following types are present:

If you look closely at the messages above, you can see a list of what still exists. Keep in mind that a maximum of five objects per category is included in the log messages. In this case, they are:

Transport Zones: Universal TZ
Edges: dlr-universal
Logical Switches: Universal Transit, Universal Test, Universal App, Universal Web, Universal DB
VNI Pools: 1 exists

This is indeed a list of everything the customer claims to have deleted from the environment. From the perspective of the ‘Transit’ manager, these objects still exist for some reason.

How We Got Here

Looking back at the order of operations the user did tells us something important:

First, he disconnected the secondary NSX Manager from the primary. This was successful, and it changed its role from Secondary to ‘Transit’.
Next, he attempted to convert it to a ‘Standalone’ manager. This failed with the same error message mentioned earlier. This seemed valid, however, because those objects really did exist.
At this point, he removed the remaining universal logical switches, edges and transport zone. These were all deleted successfully.
The subsequent attempts to convert the manager to a ‘Standalone’ continue to fail with the same error message even though the objects are gone.

Notice the very first step – they disconnected the secondary from the primary NSX Manager.

Continue reading “NSX Troubleshooting Scenario 7 – Solution”

NSX Troubleshooting Scenario 7

Welcome to the seventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“I’m in the process of decommissioning a remote location. I’m trying to convert the secondary NSX manager to a standalone, but it fails every time saying that universal objects need to be deleted. I’ve removed all of them and the error persists!”

Well, this seems odd. Let’s look at the environment and try to reproduce the issue to see what the exact error is.

It looks like the 172.19.10.40 NSX manager is currently in Transit mode. This means that it was removed as a secondary at some point but has not been converted to a standalone. This is the operation that is failing:

tshoot7a-2

Continue reading “NSX Troubleshooting Scenario 7”

NSX Troubleshooting Scenario 6 – Solution

As we saw in the first half of scenario 6, a fictional administrator enabled the DFW in their management cluster, which caused some unexpected filtering to occur. Their vCenter Server was no longer allowed the necessary HTTPS port 443 traffic needed for the vSphere Web Client to work.

Since we can no longer manage the environment or the DFW using the UI, we’ll need to revert this change using some other method.

As mentioned previously, we are fortunate in that NSX Manager is always excluded from DFW filtering by default. This is done to protect against this very type of situation. Because the NSX management plane is still fully functional, we should – in theory – still be able to relay API based DFW calls to NSX Manager. NSX Manager will in turn be able to publish these changes to the necessary ESXi hosts.

There are two relatively easy ways to fix this that come to mind:

Use the approach outlined in KB 2079620. This is the equivalent of doing a factory reset of the DFW ruleset via API. This will wipe out all rules and they’ll need to be recovered or recreated.
Use an API call to disable the DFW in the management cluster. This will essentially revert the exact change the user did in the UI that started this problem.

There are other options, but above two will work to restore HTTP/HTTPS connectivity to vCenter. Once that is done, some remediation will be necessary to ensure this doesn’t happen again. Rather than picking a specific solution, I’ll go through both of them.

Continue reading “NSX Troubleshooting Scenario 6 – Solution”

NSX Troubleshooting Scenario 6

Welcome to the sixth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

NSX Troubleshooting Scenario 6

As always, we’ll start with a brief customer problem statement:

“Help! It looks like we accidentally blocked access to vCenter Server! We have two clusters, a compute and a management cluster. My colleague noticed the firewall was disabled on the management cluster and turned it on. As soon as he did that we lost all access to the vSphere Web Client.”

Well, this sounds like a ‘chicken or the egg dilemma’ – how can they recover if they can’t log in to the vSphere Web Client to revert the changes that broke things?

In speaking with our fictional customer, we learn that some rules are in place to block all HTTP/HTTPS access in the compute cluster. Because they are still deploying VMs and getting everything patched, they are using this as a temporary means to prevent all web access. Unfortunately, he can’t remember exactly what was configured in the firewall and there may be other restrictions in place.

This was a screenshot of the last thing he saw before his web client session started timing out:

tshoot6a-1

Starting with some basic ping tests, we can see that the vCenter Server and NSX Manager are both still accessible from a layer-3 perspective:

Continue reading “NSX Troubleshooting Scenario 6”

Missing NSX vdrPort and Auto Deploy

If you are running Auto Deploy and noticed your VMs didn’t have connectivity after a host reboot or upgrade, you may have run into the problem described in VMware KB 52903. I’ve seen this a few times now with different customers and thought a PSA may be in order. You can find all the key details in the KB, but I thought I’d add some extra context here to help anyone who may want more information.

I recently helped to author VMware KB 52903, which has just been made public. Essentially, it describes a race condition causing a host to come up without its vdrPort connected to the distributed switch. The vdrPort is an important component on an ESXi host that funnels traffic to/from the NSX DLR module. If this port isn’t connected, traffic can’t make it to the DLR for east/west routing on that host. Technically, VMs in the same logical switches will be able to communicate across hosts, but none of the VMs on this impacted host will be able to route.

The Problem

The race condition occurs when the DVS registration of the host occurs too late in the boot process. Normally, the distributed switch should be initialized and registered long before the vdrPort gets connected. In some situations, however, DVS registration can be late. Obviously, if the host isn’t yet initialized/registered with the distributed switch, any attempt to connect something to it will fail. And this is exactly what happens.

Using the log lines from KB 52903 as an example, we can see that the host attempts to add the vdrPort to the distributed switch at 23:44:19:

2018-02-08T23:44:19.431Z error netcpa[3FFEDA29700] [Originator@6876 sub=Default] Failed to add vdr port on dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d, Not found

The reason the operation fails is because the DVS switch with the UUID specified is not found from the perspective of this host. It simply hasn’t been initialized yet. A few moments later, the DVS is finally ready for use on the host. Notice the time stamps – you can see the registration of the DVS about 9 seconds later:

2018-02-08T23:44:28.389Z info hostd[4F540B70] [Originator@6876 sub=Hostsvc.DvsTracker] Registered Dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d

The above message can be found in /var/log/hostd.log.

Continue reading “Missing NSX vdrPort and Auto Deploy”

NSX Troubleshooting Scenario 5 – Solution

Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.

Please see the first half for more detail on the problem symptoms and some scoping.

Reader Suggestions

There were a few good suggestions from readers. Here are a couple from Twitter:

Can you run the below 2 commands to see if the firewall rules are deployed to the host?

show dfw host host-id summarize-dvfilter
show dfw host hostID filter filterID rules

If you migrate the Ubuntu VM to esx-a1 do the rules apply properly on the VM?

— Tom Otomanski 🇵🇱 (@Th3N1ghtH4wk) February 26, 2018

Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.

Hi mike ! It is possible to get the status of VMware tools on Ubuntu VM ? 😉

— Alexis La Goutte (@alagoutte) February 23, 2018

The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.

Getting Started

In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.

Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:

VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
All three VMs have the correct security tag applied.
All three VMs are indeed showing up correctly in the security group due to the tag.
The two working VMs are on host esx-a1 and the broken VM is on host esx-a2

To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.

Continue reading “NSX Troubleshooting Scenario 5 – Solution”

NSX Troubleshooting Scenario 5

Welcome to the fifth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

NSX Troubleshooting Scenario 5

As always, we’ll start with a brief customer problem statement:

“We’ve just deployed NSX and are doing some testing with the distributed firewall. We created a security tag that we can apply to VMs to prevent them from browsing the web. We applied this tag on three virtual machines. It seems to work on two of them, but the third can always browse the web! Something is not working here”

After speaking to the customer, we were able to collect a bit more information about the VMs and traffic flows in question. Below are the VMs that should not be able to browse:

win-a1.lab.local – 172.17.1.30 (static)
lubuntu-1.lab.local – 172.17.1.101 (DHCP)
lubuntu-2.lab.local – 172.17.1.104 (DHCP)

Only the VM called lubuntu-1 is still able to browse. The others are fine. The customer has been using an internal web server called web-a1.lab.local for testing. That machine is in the same cluster and has an IP address of 172.17.1.11. It serves up a web page on port 80. All of the VMs in question are sitting in the same logical switch and the customer reports that all east-west and north-south routing is functioning normally.

To begin, let’s have a look at the DFW rules defined.

As you can see, they really did just start testing as there is only one new section and a single non-default rule. The rule is quite simple. Any HTTP/HTTPS traffic coming from VMs in the ‘No Browser’ security group should be blocked. We can also see that both this rule and the default were set to log as part of the troubleshooting activities.

Continue reading “NSX Troubleshooting Scenario 5”