An in-depth look at the NSX DFW’s IP discovery methods including Tools and ARP/DHCP snooping.
One of the best features of the DFW is the flexibility it provides in using objects in rules instead of IP addresses or groups of IP addresses. For example, for a source/destination you could use a VM in the inventory, a cluster or a security group containing all sorts of dynamic criteria. Underneath all of this, however, NSX needs to be able to inspect segment and packet headers to enforce the rules. These headers are only going to contain identifying information like IP addresses and TCP ports so it must keep track of which object is associated with which IP address or addresses. And because of the ‘distributed’ nature of the DFW, each of these translations must ultimately reach the ESXi hosts for enforcement.
There are three ways in which NSX can associate IPs with VMs – VMware Tools reporting, ARP snooping and DHCP snooping. The latter two are disabled by default.
In recent builds of NSX, you can see the detection types enabled in the host preparation section. As can be seen above, DHCP and ARP snooping are disabled by default leaving only VMware Tools address reporting.
VMware Tools Reporting
As you have probably noticed, VMs with VMware Tools installed conveniently report their configured IP addresses in the vSphere Client.
Virtual machine linux-a2 is reporting 172.16.15.10 as well as an IPv6 address on the summary tab in the vSphere Client. This information comes from VMware Tools and will be recorded in the NSX Manager database. Whenever we use a rule that references the VM linux-a2, NSX will look up this IP address for rule enforcement. These rules could contain a parent object, like the cluster compute-a, or a security group, a logical switch – anything that linux-a2 belongs to.
A useful tool for troubleshooting DFW publication failures.
If you’ve ever been on a support call for DFW publication or rule troubleshooting, you may have heard reference to a ‘firewall generation number’ at one time or another. Whenever a change is made to the firewall rules, the NSX management plane (NSX Manager) will push these changes to all ESXi hosts, where the rules will be enforced. Because of the distributed nature of this firewalling system, it’s very important that all ESXi hosts have the latest version of the ruleset.
The NSX UI does a good job of reporting on host publication failures, but its not always clear exactly what version of the rules a problematic host is enforcing.
This is where firewall generation numbers can come in handy. The ‘generation number’ represents the point in time a publish operation occurs. Although it may look like a seemingly random thirteen-digit number, it’s actually a Unix epoch timestamp (in milliseconds) that can be converted to an actual date/time. For example, an epoch timestamp of 1548677100000 equates to Monday, January 28th, 2019 at 12:05:00 UTC. There are several online tools available to help you convert these values, including this one.
Let’s have a look at the current generation number reported on a pair of ESXi hosts. One host, esx-a2 has been reporting publication failures.
To determine the generation number, you could in theory take the last reported publication date from the UI and convert it into a Unix epoch number. In my experience, there isn’t enough accuracy and you may not get an exact match. The better way to do it is to look for a “Sending rules to Cluster” log messages in the NSX manager vsm.log file. This can be done via SSH session, or more easily using a filter in vRealize Log Insight.
[root@nsxmanager /home/secureall/secureall/logs]# cat vsm.log |grep "Sending rules to Cluster"
2018-11-29 01:47:55.317 GMT+00:00 INFO TaskFrameworkExecutor-9 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: null Object Generation Number 1543456074899.
2018-11-29 01:47:57.422 GMT+00:00 INFO TaskFrameworkExecutor-16 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: 1543337228980 Object Generation Number 1543456074899.
Welcome to the twelfth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.
Please see the first half for more detail on the problem symptoms and some scoping.
As you’ll recall in the first half, our fictional customer was getting some unexpected behavior from a couple of firewall rules. Despite the rules being properly constructed, one VM called linux-a3 continued to be accessible via SSH.
We confirmed that the IP addresses for the machines in the security group where translated correctly by NSX and that the ruleset didn’t appear to be the problem. Let’s recap what we know:
VM linux-a2 seems to be working correctly and SSH traffic is blocked.
VM linux-a3 doesn’t seem to respect rule 1007 for some reason and remains accessible via SSH from everywhere.
Host esx-a3 where linux-a3 resides doesn’t appear to log any activity for rule 1007 or 1008 even though those rules are configured to log.
The two VMs are on different ESXi hosts (esx-a1 and esx-a3).
VMs linux-a2 and linux-a3 are in different dvPortgroups.
Given these statements, there are several things I’d want to check:
How can the two VMs have proper IP connectivity in VXLAN and VLAN porgroups as observed?
Is the DFW working at all on host esx-a3?
Did the last rule publication make it to host esx-a3 and does it match what we see in the UI?
Is the DFW (slot-2) dvfilter applied to linux-a3 correctly?
Welcome to the twelfth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
For this scenario today, I’ve created some supplementary video content to go along this post:
As always, we’ll start with a brief problem statement:
“I am just getting started with the NSX distributed firewall and see that the rules are not behaving as they should be. I have two VMs, linux-a2 and linux-a3 that should allow SSH from only one specific jump box. The linux-a3 VM can be accessed via SSH from anywhere! Why is this happening?”
To get started with this scenario, we’ll most certainly need to look at how the DFW rules are constructed to get the desired behavior. The immense flexibility of the distributed firewall allows for dozens of different ways to achieve what is described.
Here are the two VMs in question:
There are a couple of interesting observations above. The first is that both VMs have a security tag applied called ‘Linux-A VMs’. The other is a bit more of an oddity – one VM is in a distributed switch VLAN backed portgroup called dvpg-a-vlan15, and the other is in a VXLAN backed logical switch. Despite this, both VMs are in the same 172.16.15.0/24 subnet.
As we saw in the first half of scenario 6, a fictional administrator enabled the DFW in their management cluster, which caused some unexpected filtering to occur. Their vCenter Server was no longer allowed the necessary HTTPS port 443 traffic needed for the vSphere Web Client to work.
Since we can no longer manage the environment or the DFW using the UI, we’ll need to revert this change using some other method.
As mentioned previously, we are fortunate in that NSX Manager is always excluded from DFW filtering by default. This is done to protect against this very type of situation. Because the NSX management plane is still fully functional, we should – in theory – still be able to relay API based DFW calls to NSX Manager. NSX Manager will in turn be able to publish these changes to the necessary ESXi hosts.
There are two relatively easy ways to fix this that come to mind:
Use the approach outlined in KB 2079620. This is the equivalent of doing a factory reset of the DFW ruleset via API. This will wipe out all rules and they’ll need to be recovered or recreated.
Use an API call to disable the DFW in the management cluster. This will essentially revert the exact change the user did in the UI that started this problem.
There are other options, but above two will work to restore HTTP/HTTPS connectivity to vCenter. Once that is done, some remediation will be necessary to ensure this doesn’t happen again. Rather than picking a specific solution, I’ll go through both of them.
Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.
Please see the first half for more detail on the problem symptoms and some scoping.
There were a few good suggestions from readers. Here are a couple from Twitter:
Can you run the below 2 commands to see if the firewall rules are deployed to the host?
show dfw host host-id summarize-dvfilter show dfw host hostID filter filterID rules
If you migrate the Ubuntu VM to esx-a1 do the rules apply properly on the VM?
Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.
Hi mike ! It is possible to get the status of VMware tools on Ubuntu VM ? 😉
The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.
In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.
Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:
VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
All three VMs have the correct security tag applied.
All three VMs are indeed showing up correctly in the security group due to the tag.
The two working VMs are on host esx-a1 and the broken VM is on host esx-a2
To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.
Welcome to the fifth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 5
As always, we’ll start with a brief customer problem statement:
“We’ve just deployed NSX and are doing some testing with the distributed firewall. We created a security tag that we can apply to VMs to prevent them from browsing the web. We applied this tag on three virtual machines. It seems to work on two of them, but the third can always browse the web! Something is not working here”
After speaking to the customer, we were able to collect a bit more information about the VMs and traffic flows in question. Below are the VMs that should not be able to browse:
win-a1.lab.local – 172.17.1.30 (static)
lubuntu-1.lab.local – 172.17.1.101 (DHCP)
lubuntu-2.lab.local – 172.17.1.104 (DHCP)
Only the VM called lubuntu-1 is still able to browse. The others are fine. The customer has been using an internal web server called web-a1.lab.local for testing. That machine is in the same cluster and has an IP address of 172.17.1.11. It serves up a web page on port 80. All of the VMs in question are sitting in the same logical switch and the customer reports that all east-west and north-south routing is functioning normally.
To begin, let’s have a look at the DFW rules defined.
As you can see, they really did just start testing as there is only one new section and a single non-default rule. The rule is quite simple. Any HTTP/HTTPS traffic coming from VMs in the ‘No Browser’ security group should be blocked. We can also see that both this rule and the default were set to log as part of the troubleshooting activities.