Category Archives: Troubleshooting

NSX Troubleshooting Scenario 2 – Solution

Welcome to the second installment of a new series of NSX troubleshooting scenarios. This is the second half of scenario two, where I’ll perform some troubleshooting and resolve the problem.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As mentioned in the first half, the problem is limited to a host called esx-a1. As soon as a guest moves to that host, it has no network connectivity. If we move a guest off of the host, its connectivity is restored.

tshoot2b-1

We have one VM called win-a1 on host esx-a1 for testing purposes at the moment. As expected, the VM can’t be reached.

To begin, let’s have a look at the host from the CLI to figure out what’s going on. We know that the UI is reporting that it’s not prepared and that it doesn’t have any VTEPs created. In reality, we know a VTEP exists but let’s confirm.

To begin, we’ll check to see if any of the VIBs are installed on this host. With NSX 6.3.x, we expect to see two VIBs listed – esx-vsip and esx-vxlan.

Continue reading

NSX Troubleshooting Scenario 2

I got some overwhelmingly positive feedback after posting the first troubleshooting scenario and solution recently. Thanks to everyone who reached out to me via Twitter with feedback and suggestions! Please keep those suggestions and comments coming.

Today, I’m going to post a similar but more brief scenario. This is something that we see regularly in GSS – issues surrounding host preparation!

NSX Troubleshooting Scenario 2

Let’s begin with the usual vague customer problem description:

“We took a host out of the compute-a cluster to do some hardware maintenance. Now it’s been added back and when VMs move to this host, they have no connectivity! We’re using NSX 6.3.2”

This is a fictional scenario of course, but let’s assume that we’ve started taking a look at the environment and collecting some additional data.

As the customer mentioned, they are running NSX 6.3.2 and have a cluster called compute-a:

tshoot2a-1

The host that was taken out of the cluster for maintenance was esx-a1.lab.local. Similar to the previous scenario, the L3 design is pretty much the same:

tshoot-1

The web-a1 VM was migrated to each host, and the customer has confirmed that whenever it goes to host esx-a1, it can’t ping anything. The vMotion operation always completes successfully.

From a console of the web-a1 VM, the following destinations were tested:

  1. Default gateway (DLR at 172.17.1.1)
  2. DNS Server (172.16.10.10)
  3. Internet Location (8.8.8.8)
  4. Upstream router (172.17.0.10)
  5. VM in the same subnet/VXLAN and in compute-a cluster (172.17.1.12)
  6. VM in the same subnet/VXLAN and in compute-b cluster (172.17.1.35)

None of the above worked.

As soon as the VM is migrated back to esx-a2.lab.local or esx-a3.lab.local, it can communicate once again.

The web-a1 virtual machine is currently in VXLAN 5001 called the ‘Blue Network’:

tshoot2a-5

Taking a look in the NSX vSphere Client UI, we can see that the NSX manager and controllers appear to be in good shape:

tshoot2a-2

The Host Preparation page shows that host esx-a1 is not prepared for some reason!

tshoot2a-3

There is also a VXLAN error that reads:

“VTEP has not been created successfully on the Host.”

Oddly though, if we look at the vmkernel adapter view of the vSphere Web Client, we can see a VXLAN VTEP that exists!

tshoot2a-4

So is this host prepared or not? Either the UI is wrong or there is a problem. After asking the customer some more questions, we’ve been able to determine the steps they did with esx-a1 to get to this state. Unfortunately, the customer we’re talking to is not the same person who made the changes. That individual is on vacation now and can’t be reached – typical.

  1. Host esx-a1 was put into maintenance mode and evacuated of all VMs.
  2. Host esx-a1 was removed from the cluster.
  3. The host was then powered off.
  4. It took several weeks to get the replacement memory for the host, but eventually it was replaced.
  5. Host esx-a1 was powered back on and looked good from a hardware/vSphere perspective.
  6. Host esx-a1 was added back to the compute-a cluster. No errors were reported when this was done.
  7. The host was taken out of maintenance mode.
  8. After a day or two, DRS migrated some VMs to esx-a1 and that was when we noticed applications becoming inaccessible.
  9. To work around the issue, we migrated VMs that were on esx-a1 to other hosts in the cluster and then put DRS into manual mode to prevent anything else from moving to esx-a1.

The customer believes there may have been other things done during troubleshooting but is unsure.

What’s Next?

In a day or two I’ll post the solution and troubleshooting steps necessary to find the underlying cause of this problem. Have a look through the information provided above and let me know what you would check or what you think the problem may be! I want to hear your suggestions!

**EDIT 12/12/2017: The solution to troubleshooting scenario two is now live!

Not only do we want to figure out how things got into this state in the first place, but also how to fix this problem and get things back into a good state.

What other information would you need to see? What tests would you run? What do you know is NOT the problem based on the information and observations here?

Please feel free to leave a comment below or via Twitter (@vswitchzero).

NSX Troubleshooting Scenario 1 – Solution

Welcome to the second half of ‘NSX Troubleshooting Scenario 1’ . For detail on the problem and some initial scoping, please see the first part of the scenario that I posted a few days ago. In this half, I’ll walk through some of the troubleshooting I did to find the underlying cause of this problem as well as the solution.

Where to Start?

The scoping done in the previous post gives us a lot of useful information, but it’s not always clear where to start. In my experience, it’s helpful to make educated ‘assertions’ based on what I think the issue is – or more often what I think the issue is not.

I’ll begin by translating the scoping observations into statements:

  • It’s clear that basic L2/L3 connectivity is working to some degree. This isn’t a guarantee that there aren’t other problems, but it looks okay at a glance.
  • We know that win-b1 and web-a1 are both on the same VXLAN logical switch. We also know they are in the same subnet, so that eliminates a lot of the routing as a potential problem. The DLR and ESGs should not really be in the picture here at all.
  • The DFW is enabled, but looks to be configured with the default ‘allow’ rules only. It’s unlikely that this is a DFW problem, but we may need to prove this because the symptoms seem to be specific to HTTP.
  • We also know that VMs in the compute-b cluster are having the same types of symptoms accessing internet based web sites. We know that the infrastructure needed to get to the internet – ESGs, physical routers etc– are all accessed via the compute-a cluster.
  • It was also mentioned by the customer that the compute-b cluster was newly added. This may seem like an insignificant detail, but really increases the likelihood of a configuration or preparation problem.

Based on the testing done so far, the issue appears to be impacting a TCP service – port 80 HTTP. ICMP doesn’t seem impacted. We don’t know if other protocols are seeing similar issues.

Before we start health checking various NSX components, let’s do a bit more scoping to see if we can’t narrow this problem down even further. Right off the bat, the two questions I want answered are:

  1. Are we really talking to the device we expect from a L2 perspective?
  2. Is the problem really limited to the HTTP protocol?

Continue reading

NSX Troubleshooting Scenario 1

Welcome to the first of what I hope to be many NSX troubleshooting posts. As someone who has been working in back-line support for many years, troubleshooting is really the bread and butter of what I do every day. Solving problems in vSphere can be challenging enough, but NSX adds another thick layer of complexity to wrap your head around.

I find that there is a lot of NSX documentation out there but most of it is on to how to configure NSX and how it works – not a whole lot on troubleshooting. What I hope to do in these posts is spark some conversation and share some of the common issues I run across from day to day. Each scenario will hopefully be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there. I hope to leave a gap of a few days between the problem and solution posts to give people some time to comment, ask questions and provide their thoughts on what the problem could be!

NSX Troubleshooting Scenario 1

As always, let’s start with a somewhat vague customer problem description:

“Help! I’ve deployed a new cluster (compute-b) and for some reason I can’t access internal web sites on the compute-a cluster or at any other internet site.”

Of course, this is really only a small description of what the customer believes the problem to be. One of the key tasks for anyone working in support is to scope the problem and put together an accurate problem statement. But before we begin, let’s have a look at the customer’s environment to better understand how the new compute-b cluster fits into the grand scheme of things.

Continue reading