An in-depth look at the NSX DFW’s IP discovery methods including Tools and ARP/DHCP snooping.
One of the best features of the DFW is the flexibility it provides in using objects in rules instead of IP addresses or groups of IP addresses. For example, for a source/destination you could use a VM in the inventory, a cluster or a security group containing all sorts of dynamic criteria. Underneath all of this, however, NSX needs to be able to inspect segment and packet headers to enforce the rules. These headers are only going to contain identifying information like IP addresses and TCP ports so it must keep track of which object is associated with which IP address or addresses. And because of the ‘distributed’ nature of the DFW, each of these translations must ultimately reach the ESXi hosts for enforcement.
There are three ways in which NSX can associate IPs with VMs – VMware Tools reporting, ARP snooping and DHCP snooping. The latter two are disabled by default.
In recent builds of NSX, you can see the detection types enabled in the host preparation section. As can be seen above, DHCP and ARP snooping are disabled by default leaving only VMware Tools address reporting.
VMware Tools Reporting
As you have probably noticed, VMs with VMware Tools installed conveniently report their configured IP addresses in the vSphere Client.
Virtual machine linux-a2 is reporting 172.16.15.10 as well as an IPv6 address on the summary tab in the vSphere Client. This information comes from VMware Tools and will be recorded in the NSX Manager database. Whenever we use a rule that references the VM linux-a2, NSX will look up this IP address for rule enforcement. These rules could contain a parent object, like the cluster compute-a, or a security group, a logical switch – anything that linux-a2 belongs to.
A useful tool for troubleshooting DFW publication failures.
If you’ve ever been on a support call for DFW publication or rule troubleshooting, you may have heard reference to a ‘firewall generation number’ at one time or another. Whenever a change is made to the firewall rules, the NSX management plane (NSX Manager) will push these changes to all ESXi hosts, where the rules will be enforced. Because of the distributed nature of this firewalling system, it’s very important that all ESXi hosts have the latest version of the ruleset.
The NSX UI does a good job of reporting on host publication failures, but its not always clear exactly what version of the rules a problematic host is enforcing.
This is where firewall generation numbers can come in handy. The ‘generation number’ represents the point in time a publish operation occurs. Although it may look like a seemingly random thirteen-digit number, it’s actually a Unix epoch timestamp (in milliseconds) that can be converted to an actual date/time. For example, an epoch timestamp of 1548677100000 equates to Monday, January 28th, 2019 at 12:05:00 UTC. There are several online tools available to help you convert these values, including this one.
Let’s have a look at the current generation number reported on a pair of ESXi hosts. One host, esx-a2 has been reporting publication failures.
To determine the generation number, you could in theory take the last reported publication date from the UI and convert it into a Unix epoch number. In my experience, there isn’t enough accuracy and you may not get an exact match. The better way to do it is to look for a “Sending rules to Cluster” log messages in the NSX manager vsm.log file. This can be done via SSH session, or more easily using a filter in vRealize Log Insight.
[root@nsxmanager /home/secureall/secureall/logs]# cat vsm.log |grep "Sending rules to Cluster"
2018-11-29 01:47:55.317 GMT+00:00 INFO TaskFrameworkExecutor-9 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: null Object Generation Number 1543456074899.
2018-11-29 01:47:57.422 GMT+00:00 INFO TaskFrameworkExecutor-16 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: 1543337228980 Object Generation Number 1543456074899.
Welcome to the twelfth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.
Please see the first half for more detail on the problem symptoms and some scoping.
As you’ll recall in the first half, our fictional customer was getting some unexpected behavior from a couple of firewall rules. Despite the rules being properly constructed, one VM called linux-a3 continued to be accessible via SSH.
We confirmed that the IP addresses for the machines in the security group where translated correctly by NSX and that the ruleset didn’t appear to be the problem. Let’s recap what we know:
VM linux-a2 seems to be working correctly and SSH traffic is blocked.
VM linux-a3 doesn’t seem to respect rule 1007 for some reason and remains accessible via SSH from everywhere.
Host esx-a3 where linux-a3 resides doesn’t appear to log any activity for rule 1007 or 1008 even though those rules are configured to log.
The two VMs are on different ESXi hosts (esx-a1 and esx-a3).
VMs linux-a2 and linux-a3 are in different dvPortgroups.
Given these statements, there are several things I’d want to check:
How can the two VMs have proper IP connectivity in VXLAN and VLAN porgroups as observed?
Is the DFW working at all on host esx-a3?
Did the last rule publication make it to host esx-a3 and does it match what we see in the UI?
Is the DFW (slot-2) dvfilter applied to linux-a3 correctly?