Troubleshooting – Page 2

NSX Troubleshooting Scenario 12 – Solution

Welcome to the twelfth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was getting some unexpected behavior from a couple of firewall rules. Despite the rules being properly constructed, one VM called linux-a3 continued to be accessible via SSH.

tshoot12a-2 — The two rules in question – 1007 and 1008 – look to be constructed correctly.

We confirmed that the IP addresses for the machines in the security group where translated correctly by NSX and that the ruleset didn’t appear to be the problem. Let’s recap what we know:

VM linux-a2 seems to be working correctly and SSH traffic is blocked.
VM linux-a3 doesn’t seem to respect rule 1007 for some reason and remains accessible via SSH from everywhere.
Host esx-a3 where linux-a3 resides doesn’t appear to log any activity for rule 1007 or 1008 even though those rules are configured to log.
The two VMs are on different ESXi hosts (esx-a1 and esx-a3).
VMs linux-a2 and linux-a3 are in different dvPortgroups.

Given these statements, there are several things I’d want to check:

How can the two VMs have proper IP connectivity in VXLAN and VLAN porgroups as observed?
Is the DFW working at all on host esx-a3?
Did the last rule publication make it to host esx-a3 and does it match what we see in the UI?
Is the DFW (slot-2) dvfilter applied to linux-a3 correctly?

Continue reading “NSX Troubleshooting Scenario 12 – Solution”

NSX Troubleshooting Scenario 12

Welcome to the twelfth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

For this scenario today, I’ve created some supplementary video content to go along this post:

The Scenario

As always, we’ll start with a brief problem statement:

“I am just getting started with the NSX distributed firewall and see that the rules are not behaving as they should be. I have two VMs, linux-a2 and linux-a3 that should allow SSH from only one specific jump box. The linux-a3 VM can be accessed via SSH from anywhere! Why is this happening?”

To get started with this scenario, we’ll most certainly need to look at how the DFW rules are constructed to get the desired behavior. The immense flexibility of the distributed firewall allows for dozens of different ways to achieve what is described.

Here are the two VMs in question:

tshoot12a-4 — The VM linux-a2 is currently on host esx-a1 with IP address 172.16.15.10. It’s sitting on a logical switch.

And linux-a3:

tshoot12a-5 — The VM linux-a3 is currently on host esx-a3 with IP address 172.16.15.11. It’s sitting on a VLAN backed dvPortgroup.

There are a couple of interesting observations above. The first is that both VMs have a security tag applied called ‘Linux-A VMs’. The other is a bit more of an oddity – one VM is in a distributed switch VLAN backed portgroup called dvpg-a-vlan15, and the other is in a VXLAN backed logical switch. Despite this, both VMs are in the same 172.16.15.0/24 subnet.

Continue reading “NSX Troubleshooting Scenario 12”

No Bridged Adapters in VMware Workstation

Although I only support vSphere and VMware’s enterprise products, I use VMware Workstation every day. My work laptop runs Windows 10, but I maintain a couple of Linux VMs for day to day use as well. After a large Windows 10 feature update – 1709 I believe – I noticed that my Linux VMs were booting up without any networking. Their virtual adapters were simply reporting ‘link down’.

I had not changed any of the Workstation network configuration since I had installed it and always just used the defaults. For my guest VMs, I had always preferred to use ‘Bridged’ networking rather than NAT:

wsnets-1

What I found odd was that the VMnet0 connection usually associated with bridging was nowhere to be found in the ‘Virtual Network Editor’.

wsnets-2

When trying to add a new bridged network, I’d get the following error:

wsnets-3

The exact text is:

“Cannot change network to bridged: There are no un-bridged host network adapters.”

Clearly, Workstation thinks the adapters are already bridged despite there not being any listed in the virtual network editor.

Continue reading “No Bridged Adapters in VMware Workstation”

NSX Troubleshooting Scenario 11 – Solution

Welcome to the eleventh installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was seeing some HA heartbeat channel alarms in the new HTML5 NSX dashboard.

tshoot11a-1

After doing some digging, we were able to determine that the ESG had an interface configured for HA on VLAN 16 and that from the CLI, the edge really was complaining about being unable to reach its peer.

You probably noticed in the first half, that the HA interface doesn’t have an IP address configured. This may look odd, but it’s fine. Even if you did specify a custom /30 IP address for HA purposes, it would not show up as an interface IP address here. Rather, you’d need to look for one specified in the HA configuration settings here:

Continue reading “NSX Troubleshooting Scenario 11 – Solution”

NSX Troubleshooting Scenario 11

Welcome to the eleventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“One of my ESXi hosts has a hardware problem. Ever since putting it into maintenance mode, I’m getting edge high availability alarms in the NSX dashboard. I think this may be a false alarm, because the two appliances are in the correct active and standby roles and not in split-brain. Why is this happening?”

A good question. This customer is using NSX 6.4.0, so the new HTML5 dashboard is what they are referring to here. Let’s see the dashboard alarms first hand.

tshoot11a-1

This is alarm code 130200, which indicates a failed HA heartbeat channel. This simply means that the two ESGs can’t talk to each other on the HA interface that was specified. Let’s have a look at edge-3, which is the ESG in question.

Continue reading “NSX Troubleshooting Scenario 11”

NSX Troubleshooting Scenario 10 – Solution

Welcome to the tenth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As we saw in the first half, our fictional administrator was attempting to configure an ESG load balancer for both TCP and UDP port 514 traffic. Below is the high-level topology:

tshoot10a-1

One of the first things to keep in mind when troubleshooting the NSX load balancer is the mode in which it’s operating. In this case, we know the customer is using a one-armed load balancer. The tell-tale sign is that the ESG sits in the same VLAN as the pool members with a single interface. Also, the pool members do not have the ESG configured as their default gateway.

We also know based on the screenshots in the first half that the load balancer is not operating in ‘Transparent’ mode – so traffic to the pool members should appear as though it’s coming from the load balancer virtual IP, not from the actual syslog clients. The packet capture the customer did proves that this is actually not the case.

That said, how exactly does an NSX one-armed load balancer work?

As traffic comes in on one of the interfaces and ports configured as a ‘virtual server’, the load balancer will simply forward the traffic to one of the pool members based on the load balancing algorithm configured. In our case, it’s a simple ‘round robin’ rotation of the pool members per session/socket. But forwarding would imply that the syslog servers would see traffic coming from the originating source IP of the syslog client. This would cause a fundamental problem with asymmetry when the pool member needs to reply. When it does, the traffic would bypass the ESG and be sent directly back to the client. This would be fine with UDP, which is connection-less, but what about TCP?

Continue reading “NSX Troubleshooting Scenario 10 – Solution”

NSX Troubleshooting Scenario 10

Welcome to the tenth installment of my NSX troubleshooting series – a milestone number for the one-year anniversary of vswitchzero.com. I wasn’t sure how many of these I’d write, but I’ve gotten lots of positive feedback so if I can keep thinking of scenarios, I’ll keep going!

What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

I’ll try to include some questions as well for educational purposes in each post.

The Scenario

As always, we’ll start with a brief problem statement:

“I’m using an ESG load balancer to send syslog traffic to a pool of two Linux servers. I can only seem to get UDP syslog traffic to arrive at the pool members. TCP based syslog traffic doesn’t work. I’m using a one-armed load balancer. If I do a packet capture, all I see is the UDP traffic but it’s not coming from the load balancer”

Using the NSX load balancer services for syslog purposes is not at all uncommon. We see this frequently with products like Splunk as well as others. Since syslog traffic can be very heavy, this is a good use case.

When it comes to troubleshooting NSX load balancer issues, triple checking the configuration is key. In speaking with the customer, this is his desired outcome:

One-armed load balancer in VLAN 15.
No routing done by the edge. Default gateway configuration only and a single interface for simplicity.
Transparency is not required – the source IP can be the load balancer as the required source information is in the syslog data transmitted.
A mix of both TCP and UDP port 514 traffic is to be load balanced.

Here is a basic, high-level topology provided by the customer:

tshoot10a-1

The one armed load balancer called esg-lb1 is sitting in VLAN 15. It’s default gateway is the SVI interface of the physical switch (172.16.15.1). There is only one hop between the ESXi hosts – the syslog clients – and the ESG in VLAN 15. Because this is a one-armed topology, the syslog-a1 and syslog-a2 servers are using the same switch SVI as their default gateway.

Continue reading “NSX Troubleshooting Scenario 10”

NSX Troubleshooting Scenario 9 – Solution

Welcome to the ninth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario nine. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As we saw in the first half, our fictional administrator was unable to install the NSX VIBs on the cluster called compute-a:

We also saw that there were two different NSX licences added to vCenter. One called ‘Endpoint’ and the other ‘Enterprise’.

You can see that the ‘Usage’ for both licenses is currently “0 CPUs”, but that’s because it hasn’t been installed on any ESXi hosts yet to consume any. What’s most telling, however, is the small little grey exclamation mark on the license icon. If I hover over this, I get a message stating:

“The license is not assigned. To comply with the EULA, assign the license to at least one asset.”

Continue reading “NSX Troubleshooting Scenario 9 – Solution”

NSX Troubleshooting Scenario 9

Welcome to the ninth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“We’re in the process of deploying NSX. We were able to deploy the NSX Manager and Control Cluster, but every time we try to install the VIBs on the host, it fails with a licensing error. We have already added the license for NSX Enterprise in vCenter!”

Every time the customer tries to prepare cluster compute-a, they get the following error:

The exact error is:

“Operation is not allowed by the applied NSX license.”

Looking in the most obvious spot, we can see that the customer had indeed added a license for ‘NSX for vSphere – Enterprise’. Not only that, but there is also an ‘NSX for vShield Endpoint’ license.

Continue reading “NSX Troubleshooting Scenario 9”

NSX Troubleshooting Scenario 8 – Solution

Welcome to the eighth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario eight. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

In the first half of scenario 8, we saw that our fictional administrator was getting an error message while trying to deploy the first of three controller nodes.

The exact error was:

“Waiting for NSX controller ready controller-1 failed in deployment – Timeout on waiting for controller ready.”

Unfortunately, this doesn’t tell us a whole lot aside from the fact that the manager was waiting and eventually gave up.

tshoot8a-7

Now, before we begin troubleshooting, we should first think about the normal process for controller deployment. What exactly happens behind the scenes?

The necessary inputs are provided via the vSphere Client or REST API (i.e. deployment information like datastore, IP Pool etc).
NSX Manager then deploys a controller OVF template that is stored on it’s local filesystem. It does this using vSphere API calls via its inventory tie-in with vCenter Server.
Once the OVF template is deployed, it will be powered on.
During initial power on, the machine will receive an IP address, either via DHCP or via the pool assignment.
Once the controller node has booted, NSX Manager will begin to push the necessary configuration information to it via REST API calls.
Once the controller node is up, and is able to serve requests and communicate with NSX Manager, the deployment is considered successful and the status in the UI changes from ‘Deploying’ to ‘Connected’

Let’s have a look at the NSX Manager logging to see if we can get more information:

Continue reading “NSX Troubleshooting Scenario 8 – Solution”