NSX Troubleshooting Scenario 11

Welcome to the eleventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“One of my ESXi hosts has a hardware problem. Ever since putting it into maintenance mode, I’m getting edge high availability alarms in the NSX dashboard. I think this may be a false alarm, because the two appliances are in the correct active and standby roles and not in split-brain. Why is this happening?”

A good question. This customer is using NSX 6.4.0, so the new HTML5 dashboard is what they are referring to here. Let’s see the dashboard alarms first hand.

tshoot11a-1

This is alarm code 130200, which indicates a failed HA heartbeat channel. This simply means that the two ESGs can’t talk to each other on the HA interface that was specified. Let’s have a look at edge-3, which is the ESG in question.

Continue reading “NSX Troubleshooting Scenario 11”

NSX Troubleshooting Scenario 10 – Solution

Welcome to the tenth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As we saw in the first half, our fictional administrator was attempting to configure an ESG load balancer for both TCP and UDP port 514 traffic. Below is the high-level topology:

tshoot10a-1

One of the first things to keep in mind when troubleshooting the NSX load balancer is the mode in which it’s operating. In this case, we know the customer is using a one-armed load balancer. The tell-tale sign is that the ESG sits in the same VLAN as the pool members with a single interface. Also, the pool members do not have the ESG configured as their default gateway.

We also know based on the screenshots in the first half that the load balancer is not operating in ‘Transparent’ mode – so traffic to the pool members should appear as though it’s coming from the load balancer virtual IP, not from the actual syslog clients. The packet capture the customer did proves that this is actually not the case.

That said, how exactly does an NSX one-armed load balancer work?

As traffic comes in on one of the interfaces and ports configured as a ‘virtual server’, the load balancer will simply forward the traffic to one of the pool members based on the load balancing algorithm configured. In our case, it’s a simple ‘round robin’ rotation of the pool members per session/socket. But forwarding would imply that the syslog servers would see traffic coming from the originating source IP of the syslog client. This would cause a fundamental problem with asymmetry when the pool member needs to reply. When it does, the traffic would bypass the ESG and be sent directly back to the client. This would be fine with UDP, which is connection-less, but what about TCP?

Continue reading “NSX Troubleshooting Scenario 10 – Solution”

NSX Troubleshooting Scenario 10

Welcome to the tenth installment of my NSX troubleshooting series – a milestone number for the one-year anniversary of vswitchzero.com. I wasn’t sure how many of these I’d write, but I’ve gotten lots of positive feedback so if I can keep thinking of scenarios, I’ll keep going!

What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

I’ll try to include some questions as well for educational purposes in each post.

The Scenario

As always, we’ll start with a brief problem statement:

“I’m using an ESG load balancer to send syslog traffic to a pool of two Linux servers. I can only seem to get UDP syslog traffic to arrive at the pool members. TCP based syslog traffic doesn’t work. I’m using a one-armed load balancer. If I do a packet capture, all I see is the UDP traffic but it’s not coming from the load balancer”

Using the NSX load balancer services for syslog purposes is not at all uncommon. We see this frequently with products like Splunk as well as others. Since syslog traffic can be very heavy, this is a good use case.

When it comes to troubleshooting NSX load balancer issues, triple checking the configuration is key. In speaking with the customer, this is his desired outcome:

  • One-armed load balancer in VLAN 15.
  • No routing done by the edge. Default gateway configuration only and a single interface for simplicity.
  • Transparency is not required – the source IP can be the load balancer as the required source information is in the syslog data transmitted.
  • A mix of both TCP and UDP port 514 traffic is to be load balanced.

Here is a basic, high-level topology provided by the customer:

tshoot10a-1

The one armed load balancer called esg-lb1 is sitting in VLAN 15. It’s default gateway is the SVI interface of the physical switch (172.16.15.1). There is only one hop between the ESXi hosts – the syslog clients – and the ESG in VLAN 15. Because this is a one-armed topology, the syslog-a1 and syslog-a2 servers are using the same switch SVI as their default gateway.

Continue reading “NSX Troubleshooting Scenario 10”

Blank Error While Adding NSX DLR or ESG Interfaces

I recently deployed NSX 6.3.2 in my home lab to do some testing. After deploying a DLR, I went back in to add some additional interfaces and was greeted by a ‘blank’ or null error message. Having run into this problem before, I thought it may be a good idea to give some additional context to VMware KB 2151309.

dlrblankerror-1

As you can see above, there is no text associated with the error. There are no problems with the IP or mask I used, and it doesn’t seem clear why this would be failing.

You would expect to find more detail in the NSX Manager vsm.log file, but interestingly there is nothing there at all for this exception. That’s because this isn’t an NSX fault, but rather something in the vSphere Web Client.

Continue reading “Blank Error While Adding NSX DLR or ESG Interfaces”

NSX Troubleshooting Scenario 9 – Solution

Welcome to the ninth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario nine. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As we saw in the first half, our fictional administrator was unable to install the NSX VIBs on the cluster called compute-a:

tshoot9a-1

We also saw that there were two different NSX licences added to vCenter. One called ‘Endpoint’ and the other ‘Enterprise’.

tshoot9b-1

You can see that the ‘Usage’ for both licenses is currently “0 CPUs”, but that’s because it hasn’t been installed on any ESXi hosts yet to consume any. What’s most telling, however, is the small little grey exclamation mark on the license icon. If I hover over this, I get a message stating:

“The license is not assigned. To comply with the EULA, assign the license to at least one asset.”

Continue reading “NSX Troubleshooting Scenario 9 – Solution”

NSX Troubleshooting Scenario 9

Welcome to the ninth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“We’re in the process of deploying NSX. We were able to deploy the NSX Manager and Control Cluster, but every time we try to install the VIBs on the host, it fails with a licensing error. We have already added the license for NSX Enterprise in vCenter!”

Every time the customer tries to prepare cluster compute-a, they get the following error:

tshoot9a-1

The exact error is:

“Operation is not allowed by the applied NSX license.”

Looking in the most obvious spot, we can see that the customer had indeed added a license for ‘NSX for vSphere – Enterprise’. Not only that, but there is also an ‘NSX for vShield Endpoint’ license.

Continue reading “NSX Troubleshooting Scenario 9”

NSX Troubleshooting Scenario 8 – Solution

Welcome to the eighth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario eight. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

In the first half of scenario 8, we saw that our fictional administrator was getting an error message while trying to deploy the first of three controller nodes.

The exact error was:

“Waiting for NSX controller ready controller-1 failed in deployment – Timeout on waiting for controller ready.”

Unfortunately, this doesn’t tell us a whole lot aside from the fact that the manager was waiting and eventually gave up.

tshoot8a-7

Now, before we begin troubleshooting, we should first think about the normal process for controller deployment. What exactly happens behind the scenes?

  1. The necessary inputs are provided via the vSphere Client or REST API (i.e. deployment information like datastore, IP Pool etc).
  2. NSX Manager then deploys a controller OVF template that is stored on it’s local filesystem. It does this using vSphere API calls via its inventory tie-in with vCenter Server.
  3. Once the OVF template is deployed, it will be powered on.
  4. During initial power on, the machine will receive an IP address, either via DHCP or via the pool assignment.
  5. Once the controller node has booted, NSX Manager will begin to push the necessary configuration information to it via REST API calls.
  6. Once the controller node is up, and is able to serve requests and communicate with NSX Manager, the deployment is considered successful and the status in the UI changes from ‘Deploying’ to ‘Connected’

Let’s have a look at the NSX Manager logging to see if we can get more information:

Continue reading “NSX Troubleshooting Scenario 8 – Solution”

NSX Troubleshooting Scenario 8

Welcome to the eighth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“I’m doing a new greenfield deployment of NSX and my control cluster is failing to deploy. It seems stuck at ‘Deploying’ and then after a long period of time, it gives me a failure and the appliance gets deleted.”

Let’s have a look and see what this fictional administrator is seeing:

tshoot8a-2

We can see that they’ve successfully deployed NSX Manager at version 6.3.2 and have no controllers successfully deployed yet.

tshoot8a-3

A valid looking IP pool has been created for the controllers with all the pertinent IP settings populated. The controller deployment is being done with the following settings:

Continue reading “NSX Troubleshooting Scenario 8”

NSX Troubleshooting Scenario 7 – Solution

Welcome to the seventh installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario seven. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

In the first half of this scenario, we saw that our fictional customer was hitting an exception every time they tried to convert their secondary – now a transit – NSX Manager to the standalone role. The error message seemed to imply that numerous universal objects were still in the environment.

tshoot7a-2

Our quick spot checks didn’t show any lingering universal objects, but looking at the NSX Manager logging can tell us a bit more about what still exists:

2018-03-26 22:27:21.779 GMT  INFO http-nio-127.0.0.1-7441-exec-1 ReplicationConfigurationServiceImpl:152 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Role validation successful
2018-03-26 22:27:21.792 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 1 universal objects exists for type VdnScope
2018-03-26 22:27:21.793 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): Universal TZ
2018-03-26 22:27:21.794 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 1 universal objects exists for type Edge
2018-03-26 22:27:21.797 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): dlr-universal
2018-03-26 22:27:21.798 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 5 universal objects exists for type VirtualWire
2018-03-26 22:27:21.806 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): Universal Transit, Universal Test, Universal App, Universal Web, Universal DB
2018-03-26 22:27:21.809 GMT  INFO http-nio-127.0.0.1-7441-exec-1 L2UniversalSyncListenerImpl:58 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Global VNI pool exists
2018-03-26 22:27:21.814 GMT  WARN http-nio-127.0.0.1-7441-exec-1 ReplicationConfigurationServiceImpl:101 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Setting role to TRANSIT because following object types have universal objects VniPool,VdnScope,Edge,VirtualWire
2018-03-26 22:27:21.816 GMT  INFO http-nio-127.0.0.1-7441-exec-1 AuditingServiceImpl:174 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] [AuditLog] UserName:'LAB\mike', ModuleName:'UniversalSync', Operation:'ASSIGN_STANDALONE_ROLE', Resource:'', Time:'Mon Mar 26 14:27:21.815 GMT 2018', Status:'FAILURE', Universal Object:'false'
2018-03-26 22:27:21.817 GMT  WARN http-nio-127.0.0.1-7441-exec-1 RemoteInvocationTraceInterceptor:88 - Processing of VsmHttpInvokerServiceExporter remote call resulted in fatal exception: com.vmware.vshield.vsm.replicator.configuration.facade.ReplicatorConfigurationFacade.setAsStandalonere.vshield.vsm.exceptions.InvalidArgumentException: core-services:125023:Unable to assign STANDALONE role. Universal objects of following types are present:

If you look closely at the messages above, you can see a list of what still exists. Keep in mind that a maximum of five objects per category is included in the log messages. In this case, they are:

Transport Zones: Universal TZ
Edges: dlr-universal
Logical Switches: Universal Transit, Universal Test, Universal App, Universal Web, Universal DB
VNI Pools: 1 exists

This is indeed a list of everything the customer claims to have deleted from the environment. From the perspective of the ‘Transit’ manager, these objects still exist for some reason.

How We Got Here

Looking back at the order of operations the user did tells us something important:

  1. First, he disconnected the secondary NSX Manager from the primary. This was successful, and it changed its role from Secondary to ‘Transit’.
  2. Next, he attempted to convert it to a ‘Standalone’ manager. This failed with the same error message mentioned earlier. This seemed valid, however, because those objects really did exist.
  3. At this point, he removed the remaining universal logical switches, edges and transport zone. These were all deleted successfully.
  4. The subsequent attempts to convert the manager to a ‘Standalone’ continue to fail with the same error message even though the objects are gone.

Notice the very first step – they disconnected the secondary from the primary NSX Manager.

Continue reading “NSX Troubleshooting Scenario 7 – Solution”

NSX Troubleshooting Scenario 7

Welcome to the seventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“I’m in the process of decommissioning a remote location. I’m trying to convert the secondary NSX manager to a standalone, but it fails every time saying that universal objects need to be deleted. I’ve removed all of them and the error persists!”

Well, this seems odd. Let’s look at the environment and try to reproduce the issue to see what the exact error is.

tshoot7a-1

It looks like the 172.19.10.40 NSX manager is currently in Transit mode. This means that it was removed as a secondary at some point but has not been converted to a standalone. This is the operation that is failing:

tshoot7a-2

Continue reading “NSX Troubleshooting Scenario 7”