An in-depth look at the NSX DFW’s IP discovery methods including Tools and ARP/DHCP snooping.
One of the best features of the DFW is the flexibility it provides in using objects in rules instead of IP addresses or groups of IP addresses. For example, for a source/destination you could use a VM in the inventory, a cluster or a security group containing all sorts of dynamic criteria. Underneath all of this, however, NSX needs to be able to inspect segment and packet headers to enforce the rules. These headers are only going to contain identifying information like IP addresses and TCP ports so it must keep track of which object is associated with which IP address or addresses. And because of the ‘distributed’ nature of the DFW, each of these translations must ultimately reach the ESXi hosts for enforcement.
There are three ways in which NSX can associate IPs with VMs – VMware Tools reporting, ARP snooping and DHCP snooping. The latter two are disabled by default.
In recent builds of NSX, you can see the detection types enabled in the host preparation section. As can be seen above, DHCP and ARP snooping are disabled by default leaving only VMware Tools address reporting.
VMware Tools Reporting
As you have probably noticed, VMs with VMware Tools installed conveniently report their configured IP addresses in the vSphere Client.
Virtual machine linux-a2 is reporting 172.16.15.10 as well as an IPv6 address on the summary tab in the vSphere Client. This information comes from VMware Tools and will be recorded in the NSX Manager database. Whenever we use a rule that references the VM linux-a2, NSX will look up this IP address for rule enforcement. These rules could contain a parent object, like the cluster compute-a, or a security group, a logical switch – anything that linux-a2 belongs to.
Continue reading “Understanding NSX IP Discovery”
A useful tool for troubleshooting DFW publication failures.
If you’ve ever been on a support call for DFW publication or rule troubleshooting, you may have heard reference to a ‘firewall generation number’ at one time or another. Whenever a change is made to the firewall rules, the NSX management plane (NSX Manager) will push these changes to all ESXi hosts, where the rules will be enforced. Because of the distributed nature of this firewalling system, it’s very important that all ESXi hosts have the latest version of the ruleset.
The NSX UI does a good job of reporting on host publication failures, but its not always clear exactly what version of the rules a problematic host is enforcing.
This is where firewall generation numbers can come in handy. The ‘generation number’ represents the point in time a publish operation occurs. Although it may look like a seemingly random thirteen-digit number, it’s actually a Unix epoch timestamp (in milliseconds) that can be converted to an actual date/time. For example, an epoch timestamp of 1548677100000 equates to Monday, January 28th, 2019 at 12:05:00 UTC. There are several online tools available to help you convert these values, including this one.
Let’s have a look at the current generation number reported on a pair of ESXi hosts. One host, esx-a2 has been reporting publication failures.
To determine the generation number, you could in theory take the last reported publication date from the UI and convert it into a Unix epoch number. In my experience, there isn’t enough accuracy and you may not get an exact match. The better way to do it is to look for a “Sending rules to Cluster” log messages in the NSX manager vsm.log file. This can be done via SSH session, or more easily using a filter in vRealize Log Insight.
[root@nsxmanager /home/secureall/secureall/logs]# cat vsm.log |grep "Sending rules to Cluster"
2018-11-29 01:47:55.317 GMT+00:00 INFO TaskFrameworkExecutor-9 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: null Object Generation Number 1543456074899.
2018-11-29 01:47:57.422 GMT+00:00 INFO TaskFrameworkExecutor-16 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: 1543337228980 Object Generation Number 1543456074899.
Continue reading “Understanding NSX DFW Generation Numbers”
Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.
Please see the first half for more detail on the problem symptoms and some scoping.
There were a few good suggestions from readers. Here are a couple from Twitter:
Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.
The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.
In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.
Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:
- VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
- VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
- The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
- The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
- All three VMs have the correct security tag applied.
- All three VMs are indeed showing up correctly in the security group due to the tag.
- The two working VMs are on host esx-a1 and the broken VM is on host esx-a2
To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.
Continue reading “NSX Troubleshooting Scenario 5 – Solution”
Welcome to the fifth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 5
As always, we’ll start with a brief customer problem statement:
“We’ve just deployed NSX and are doing some testing with the distributed firewall. We created a security tag that we can apply to VMs to prevent them from browsing the web. We applied this tag on three virtual machines. It seems to work on two of them, but the third can always browse the web! Something is not working here”
After speaking to the customer, we were able to collect a bit more information about the VMs and traffic flows in question. Below are the VMs that should not be able to browse:
- win-a1.lab.local – 172.17.1.30 (static)
- lubuntu-1.lab.local – 172.17.1.101 (DHCP)
- lubuntu-2.lab.local – 172.17.1.104 (DHCP)
Only the VM called lubuntu-1 is still able to browse. The others are fine. The customer has been using an internal web server called web-a1.lab.local for testing. That machine is in the same cluster and has an IP address of 172.17.1.11. It serves up a web page on port 80. All of the VMs in question are sitting in the same logical switch and the customer reports that all east-west and north-south routing is functioning normally.
To begin, let’s have a look at the DFW rules defined.
As you can see, they really did just start testing as there is only one new section and a single non-default rule. The rule is quite simple. Any HTTP/HTTPS traffic coming from VMs in the ‘No Browser’ security group should be blocked. We can also see that both this rule and the default were set to log as part of the troubleshooting activities.
Continue reading “NSX Troubleshooting Scenario 5”