Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.
Please see the first half for more detail on the problem symptoms and some scoping.
There were a few good suggestions from readers. Here are a couple from Twitter:
Can you run the below 2 commands to see if the firewall rules are deployed to the host?
show dfw host host-id summarize-dvfilter show dfw host hostID filter filterID rules
If you migrate the Ubuntu VM to esx-a1 do the rules apply properly on the VM?
Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.
Hi mike ! It is possible to get the status of VMware tools on Ubuntu VM ? 😉
The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.
In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.
Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:
VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
All three VMs have the correct security tag applied.
All three VMs are indeed showing up correctly in the security group due to the tag.
The two working VMs are on host esx-a1 and the broken VM is on host esx-a2
To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.
Welcome to the fifth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 5
As always, we’ll start with a brief customer problem statement:
“We’ve just deployed NSX and are doing some testing with the distributed firewall. We created a security tag that we can apply to VMs to prevent them from browsing the web. We applied this tag on three virtual machines. It seems to work on two of them, but the third can always browse the web! Something is not working here”
After speaking to the customer, we were able to collect a bit more information about the VMs and traffic flows in question. Below are the VMs that should not be able to browse:
win-a1.lab.local – 172.17.1.30 (static)
lubuntu-1.lab.local – 172.17.1.101 (DHCP)
lubuntu-2.lab.local – 172.17.1.104 (DHCP)
Only the VM called lubuntu-1 is still able to browse. The others are fine. The customer has been using an internal web server called web-a1.lab.local for testing. That machine is in the same cluster and has an IP address of 172.17.1.11. It serves up a web page on port 80. All of the VMs in question are sitting in the same logical switch and the customer reports that all east-west and north-south routing is functioning normally.
To begin, let’s have a look at the DFW rules defined.
As you can see, they really did just start testing as there is only one new section and a single non-default rule. The rule is quite simple. Any HTTP/HTTPS traffic coming from VMs in the ‘No Browser’ security group should be blocked. We can also see that both this rule and the default were set to log as part of the troubleshooting activities.
Welcome to the fourth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario four. Today I’ll be performing some troubleshooting and will resolve the issue.
Please see the first half for more detail on the problem symptoms and some scoping.
During the scoping in the first half of the scenario, we saw that the problem was squarely in the customer’s new secondary NSX deployment. Two test virtual machines – linux-r1 and linux-r2 – could not be added to any of the universal logical switches.
From the ‘Logical Switches’ view in the NSX Web Client UI, we could see that these universal logical switches were synchronized across both NSX Managers. These existed from the perspective of the Primary and Secondary manager views:
Perhaps the most telling observation, however, was the absence of distributed port groups associated with the universal logical switches on the dvs-rem switch:
As we can see above, the port groups do exist for logical switches in the VNI 900x range. These are non-universal, logical switches available to the secondary NSX deployment only.
In the host preparation section, we can see that dvs-rem is indeed the configured distributed switch for the compute-r cluster and that both hosts look good from a VTEP/VXLAN perspective:
So why are these port groups missing? Without them, VMs simply can’t be added to the associated logical switches.
Although you’ve probably noticed that I like to dig deep in some of these scenarios, this one is actually pretty straight forward. A straight forward, but all too common problem – the cluster has not been added to the universal transport zone.
You’d be surprised how often I see this, but to be fair, it’s very easily overlooked. I sometimes need to remind myself to check all the basics first, especially when dealing with new deployments. The key symptom that raised red flags for me was the lack of auto-generated port groups on the distributed switch. The addition of the cluster to the transport zone will trigger the creation of these port groups. If they don’t exist, this should be the first thing that is checked.
As soon as I added the compute-r cluster to the Universal TZ transport zone, we see an immediate slew of portgroup creation tasks:
I’ve now essentially told NSX that I want all the logical switches in that transport zone to span to the compute-r cluster. In NSX-V, we can think of a transport zone as a boundary spanning one or more clusters. Only clusters in that transport zone will have the logical switches available to them for use.
The concept of a ‘Universal Transport Zone’ just takes this a step further and allows clusters in different vCenter instances to connect to the same universal logical switches. The fact that we saw portgroups for the 9000-900X range of VNIs tells us that the compute-r cluster existed in the non-universal Transport Zone called ‘Remote TZ’, but was missing from ‘Universal TZ’.
Thanks again to everyone for posting their testing suggestions and theories! I hope you enjoyed this scenario. If you have other suggestions for troubleshooting scenarios you’d like to see, please leave a comment, or reach out to me on Twitter (@vswitchzero).
Time for another NSX troubleshooting scenario! Welcome to the fourth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 4
As always, we’ll start with a customer problem statement:
“We recently deployed Cross-vCenter NSX for a remote datacenter location. When we try to add VMs to the universal logical switches, there are no VM vNICs in the list to add. This works fine at the primary datacenter.”
This customer’s environment will be the same as what we outlined in scenario 3. Keep in mind that this should be treated separately. Forget everything from the previous scenario.
The main location is depicted on the left. A three host cluster called compute-a exists there. All of the VLAN backed networks route through a router called vyos. The Universal Control Cluster exists at this location, as does the primary NSX manager.
The ‘remote datacenter’ is to the right of the dashed line. The single ‘compute-r’ cluster there is associated with the secondary NSX manager at that location. According to the customer, this was only recently added.
Thanks to everyone who took the time to comment on the first half of scenario 3, both here and on twitter. There were many great suggestions, and some were spot-on!
For more detail on the problem, some diagrams and other scoping information, be sure to check out the first half of scenario 3.
During the initial scoping in the first half, we didn’t really see too much out of the ordinary in the UI aside from some odd ‘red alarm’ exclamation marks on the compute-r hosts in the Host Preparation section.
More than one commenter pointed out that this needs to be investigated. I wholeheartedly agree. Despite seeing a green status for host VIB installation, firewall status and VXLAN, there is clearly still a problem. That problem is related to ‘Communication Channel Health’.
The communication channel health check was a new feature added in NSX 6.2 and makes it easy to see which hosts are having problems communicating with both NSX Manager and the Control Cluster. In our case, both esx-r1 and esx-r2 are reporting problems with their control plane agent (netcpa) to all three controllers.
Welcome to the third installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 3
I’ll start off again with a brief customer problem description:
“We’ve recently deployed Cross-vCenter NSX for a remote datacenter location. All of the VMs at that location don’t have connectivity. They can’t ping their gateway, nor can they ping VMs on other hosts. Only VMs on the same host can ping each other.”
This is a pretty vague description of the problem, so let’s have a closer look at this environment. To begin, let’s look at the high-level physical interconnection between datacenters in the following diagram:
There isn’t a lot of detail above, but it helps to give us some talking points. The main location is depicted on the left. A three host cluster called compute-a exists there. All of the VLAN backed networks route through a router called vyos. The Universal Control Cluster exists at this location, as does the primary NSX manager.
If you are new to NSX or looking to evaluate it in the lab, there is one very common issue that you may run into. After going through the initial steps of deploying and registering NSX Manager with vCenter, you may be surprised to find that there are no manageable NSX managers listed under ‘Networking and Security’ in the Web Client. Although the registration and Web Client plugin installation appears successful, there is often an extra step needed before you can manage things.
One of the first tasks involved in deploying NSX is to register NSX Manager with a vCenter Server. This is done for inventory management and synchronization purposes. The NSX Manager can be optionally registered with SSO as well.
The vCenter user that is used for registration needs to have the highest level of privileges for NSX to work correctly. The NSX install guide clearly states that this must be the vCenter ‘Administrator’ role.
From the NSX Install Guide:
“You must have a vCenter Server user account with the Administrator role to synchronize NSX Manager with the vCenter Server. If your vCenter password has non-ASCII characters, you must change it before synchronizing the NSX Manager with the vCenter Server.”
Because of these requirements, it’s quite common to use the SSO administrator account – usually firstname.lastname@example.org. A service account is also often created for this purpose to more easily identify and distinguish NSX tasks. Either way, these are not normally accounts that you’d use for day-to-day administration in vSphere.
By default, NSX will only assign its ‘Enterprise Administrator’ role to the user account that was used to register it with vCenter Server. This means that by default, only that specific vCenter user will have access to the NSX manager from within the Web Client.
That said, if you are experiencing this problem, you are probably not logged in with the vCenter user that was used for registration purposes. To grant access to other users, you’ll need to log into the vSphere Web Client using the registration user account, and then add additional users and groups.
In my lab, I’ve just logged in with an active directory user called ‘email@example.com’. This user has full administrator privileges in vCenter, but has no access to any NSX Managers:
If I log out, and log back in with the firstname.lastname@example.org account that was used for vCenter registration, I can see the NSX managers that were registered.
In my lab, I’ve got a secondary deployed as well, but we’ll focus only on 172.16.10.40. If I click on that manager in the list, I’m able to go to the ‘Users’ tab to see what the default permissions look like:
As you can see, only one user – the SSO administrator account used for registration – has the requisite role for administrator via the Web Client. In my lab, I want to provide full access to an AD group called ‘VMware Admins’ and an individual user called ‘Test’.
Both vCenter users and groups can be specified here. As long as vCenter can authenticate them – either via SSO, local authentication or even AD – they are fair game.
Another common mistake made is selecting the NSX Administrator role rather than Enterprise Administrator. NSX Administrator sounds like the highest privilege level, but it’s actually Enterprise Administrator that gives you all the keys to the kingdom. You won’t be able to administer certain things – including user permissions – unless Enterprise Administrator is chosen.
Once this is done, you’ll see the users and groups listed and should now have the correct permissions to administer the NSX deployment!
Keep in mind that if you’ve got more than one NSX manager deployed, you’ll need to set this on each independently.
Have any questions or want more information? Please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)
Welcome to the second installment of a new series of NSX troubleshooting scenarios. This is the second half of scenario two, where I’ll perform some troubleshooting and resolve the problem.
Please see the first half for more detail on the problem symptoms and some scoping.
As mentioned in the first half, the problem is limited to a host called esx-a1. As soon as a guest moves to that host, it has no network connectivity. If we move a guest off of the host, its connectivity is restored.
We have one VM called win-a1 on host esx-a1 for testing purposes at the moment. As expected, the VM can’t be reached.
To begin, let’s have a look at the host from the CLI to figure out what’s going on. We know that the UI is reporting that it’s not prepared and that it doesn’t have any VTEPs created. In reality, we know a VTEP exists but let’s confirm.
To begin, we’ll check to see if any of the VIBs are installed on this host. With NSX 6.3.x, we expect to see two VIBs listed – esx-vsip and esx-vxlan.
I got some overwhelmingly positive feedback after posting the first troubleshooting scenario and solution recently. Thanks to everyone who reached out to me via Twitter with feedback and suggestions! Please keep those suggestions and comments coming.
Today, I’m going to post a similar but more brief scenario. This is something that we see regularly in GSS – issues surrounding host preparation!
NSX Troubleshooting Scenario 2
Let’s begin with the usual vague customer problem description:
“We took a host out of the compute-a cluster to do some hardware maintenance. Now it’s been added back and when VMs move to this host, they have no connectivity! We’re using NSX 6.3.2”
This is a fictional scenario of course, but let’s assume that we’ve started taking a look at the environment and collecting some additional data.
As the customer mentioned, they are running NSX 6.3.2 and have a cluster called compute-a:
The host that was taken out of the cluster for maintenance was esx-a1.lab.local. Similar to the previous scenario, the L3 design is pretty much the same:
Welcome to the second half of ‘NSX Troubleshooting Scenario 1’ . For detail on the problem and some initial scoping, please see the first part of the scenario that I posted a few days ago. In this half, I’ll walk through some of the troubleshooting I did to find the underlying cause of this problem as well as the solution.
Where to Start?
The scoping done in the previous post gives us a lot of useful information, but it’s not always clear where to start. In my experience, it’s helpful to make educated ‘assertions’ based on what I think the issue is – or more often what I think the issue is not.
I’ll begin by translating the scoping observations into statements:
It’s clear that basic L2/L3 connectivity is working to some degree. This isn’t a guarantee that there aren’t other problems, but it looks okay at a glance.
We know that win-b1 and web-a1 are both on the same VXLAN logical switch. We also know they are in the same subnet, so that eliminates a lot of the routing as a potential problem. The DLR and ESGs should not really be in the picture here at all.
The DFW is enabled, but looks to be configured with the default ‘allow’ rules only. It’s unlikely that this is a DFW problem, but we may need to prove this because the symptoms seem to be specific to HTTP.
We also know that VMs in the compute-b cluster are having the same types of symptoms accessing internet based web sites. We know that the infrastructure needed to get to the internet – ESGs, physical routers etc– are all accessed via the compute-a cluster.
It was also mentioned by the customer that the compute-b cluster was newly added. This may seem like an insignificant detail, but really increases the likelihood of a configuration or preparation problem.
Based on the testing done so far, the issue appears to be impacting a TCP service – port 80 HTTP. ICMP doesn’t seem impacted. We don’t know if other protocols are seeing similar issues.
Before we start health checking various NSX components, let’s do a bit more scoping to see if we can’t narrow this problem down even further. Right off the bat, the two questions I want answered are:
Are we really talking to the device we expect from a L2 perspective?
Is the problem really limited to the HTTP protocol?