Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.
Please see the first half for more detail on the problem symptoms and some scoping.
Reader Suggestions
There were a few good suggestions from readers. Here are a couple from Twitter:
Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.
The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.
Getting Started
In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.
Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:
- VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
- VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
- The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
- The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
- All three VMs have the correct security tag applied.
- All three VMs are indeed showing up correctly in the security group due to the tag.
- The two working VMs are on host esx-a1 and the broken VM is on host esx-a2
To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.
First, let’s ensure the vShield-Stateful-Firewall service is running on the host. This service doesn’t actually perform the filtering but it is responsible for RabbitMQ management plane communication to NSX manager and the synchronization of firewall rules.
[root@esx-a2:~] /etc/init.d/vShield-Stateful-Firewall status vShield-Stateful-Firewall is running [root@esx-a2:~] esxcli network ip connection list |grep 5671 tcp 0 0 172.16.10.22:53998 172.16.10.40:5671 ESTABLISHED 68767 newreno vsfwd tcp 0 0 172.16.10.22:21529 172.16.10.40:5671 ESTABLISHED 68767 newreno vsfwd tcp 0 0 172.16.10.22:39057 172.16.10.40:5671 ESTABLISHED 68767 newreno vsfwd
We can also see that there are established TCP 5671 sockets with NSX Manager. Based on this, we know there is a communication channel open to NSX Manager and that any firewall changes should get pushed down to this host.
Next, let’s ensure that the lubuntu-1 virtual machine has the DFW dvFilter applied to it.
[root@esx-a2:~] summarize-dvfilter Fastpaths: agent: dvfilter-faulter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter agent: ESXi-Firewall, refCount: 6, rev: 0x1010000, apiRev: 0x1010000, module: esxfw agent: dvfilter-generic-vmware, refCount: 2, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter-generic-fastpath agent: dvfilter-generic-vmware-swsec, refCount: 2, rev: 0x1010000, apiRev: 0x1010000, module: nsx-dvfilter-switch-security agent: bridgelearningfilter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vdrb agent: vmware-sfw, refCount: 3, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vsip Slowpaths: slowPath: 2, agent vmware-sfw, refCount: 2, rev: 0x4, apiRev: 0x3, capabilities: Filters: <snip> world 69414 vmm0:lubuntu-1 vcUuid:'50 26 1a b6 b1 b8 cc f3-40 22 71 ac 8c d1 29 df' port 50331659 lubuntu-1.eth0 vNic slot 2 name: nic-69414-eth0-vmware-sfw.2 agentName: vmware-sfw state: IOChain Attached vmState: Attached failurePolicy: failClosed slowPathID: 2 filter source: Dynamic Filter Creation vNic slot 1 name: nic-69414-eth0-dvfilter-generic-vmware-swsec.1 agentName: dvfilter-generic-vmware-swsec state: IOChain Attached vmState: Detached failurePolicy: failClosed slowPathID: none filter source: Alternate Opaque Channel
The summarize-dvfilter command is extremely useful in DFW troubleshooting and can tell us a lot. There are two key things here that I want to check. First, we should see a FastPath filter called vmware-sfw listed and associated with the module called nsx-vsip. This tells us that the DFW module on the host is running. Second, we need to look in the output for the virtual machine lubuntu-1. We need to ensure that a slot-2 dvFilter is applied to the VM. Slot-2 is the position in the dvFilter I/O chain for the NSX distributed firewall and never changes.
In the output above, we can see that a slot-2 filter called nic-69414-eth0-vmware-sfw.2 is indeed applied to lubuntu-1. We know that some sort of DFW filtering will be done on this VM, but the question remains – why is lubuntu-1 behaving differently than the other two VMs?
The next thing I want to check is to ensure the DFW rule set is up-to-date and synchronized on host esx-a2. According to the customer, on some point on February 22nd, this new firewall rule was published and pushed down to hosts in the environment. How do we know that esx-a2 successfully received this information? In theory, lubuntu-1 is behaving as if rule 1005 simply doesn’t exist. If that rule didn’t make it to the host, that could explain the problem.
Confirming DFW Synchronization
As a next step, there are two things we are going to check. First, we want to ensure the DFW configuration is synchronized on host esx-a2. Second, we’ll take a look at the actual filter ruleset that is applied to lubuntu-1 to ensure that rule 1005 exists.
Firewall synchronization can be checked several different ways, but the best is to look for what’s referred to as the ‘generation number’. The generation number is essentially just a date represented in ‘epoch’ or Unix time code used for version control. Every time NSX pushes DFW configuration to a host, it should do so with a generation number. At any given time, the generation number in ESXi’s configuration should match that of NSX manager.
First, let’s use a simple API call to get the generation number from NSX’s perspective:
GET https://nsxmanager.lab.local/api/4.0/firewall/globalroot-0/config
The complete DFW configuration will be returned, but we’re only interested in the generation number near the top of the output:
<?xml version="1.0" encoding="UTF-8"?> <firewallConfiguration timestamp="1519339055838"> <contextId>globalroot-0</contextId> <layer3Sections> <section id="1004" name="Test Section" generationNumber="1519339055838" timestamp="1519339055838" tcpStrict="false" stateless="false" useSid="false" type="LAYER3"> <rule id="1005" disabled="false" logged="true"> <name>Browse Tag Enforce</name> <action>deny</action> <appliedToList> <appliedTo> <name>compute-a</name> <value>domain-c121</value> <type>ClusterComputeResource</type> <isValid>true</isValid> </appliedTo> <snip>
So according to NSX Manager, the latest firewall configuration ‘version’ was generation number 1519339055838. This epoch or ‘Unix Time’ number can easily be converted by any number of epoch converter sites online. This number represents Thursday, February 22, 2018 10:37:35.838 PM UTC.
Next, we need to compare this to the generation number in ESXi. There are a few different ways this can be done, but I’ll simply look in /var/log/vsfwd.log for the publish instructions. When NSX Manager pushes this down, the vsfwd.log file will record the generation number as well as the application of new rules to the vNIC filters.
2018-02-22T22:34:20Z vsfwd: [INFO] Applying firewall config to vnic list on host host-225 2018-02-22T22:34:20Z vsfwd: [INFO] Applied RuleSet 1519339055838 on vnic 50261ab6-b1b8-ccf3-4022-71ac8cd129df.000 2018-02-22T22:34:20Z vsfwd: [INFO] Applied RuleSet 1519339055838 on vnic 50264b96-2cf4-8a8c-72c2-0dc31f61c109.000 2018-02-22T22:34:20Z vsfwd: [INFO] Applied RuleSet 1519339055838 for all vnics
As you can see above, the host did indeed react upon the new firewall configuration on the 22nd and applied this ruleset to vNICs. If you look back at the summarize-dvfilter output, you’ll remember that vnic 50261ab6-b1b8-ccf3-4022-71ac8cd129df.000 is associated with lubuntu-1.
Another way you can get an ESXi host’s generation number is to run the following vsipioctl command:
[root@esx-a2:~] vsipioctl loadruleset | head -10 Loading ruleset file: /etc/vmware/vsfwd/vsipfw_ruleset.dat ################################################## # ruleset message dump # ################################################## ActionType : replace Id : domain-c121 Name : domain-c121 Generation : 1519339055838 Rule Count : 8 layer2 rule 1004 {
A Closer Look at the Filter
So now that we know the host got the new configuration and acted upon it, let’s have a closer look at the dvFilter in question. Looking back at the summarize-dvfilter output, we’ll first want to get the slot-2 filter name applied to the lubuntu-1 VM. In this case it is:
nic-69414-eth0-vmware-sfw.2.
With this filter name, we can query ESXi for the actual rulset applied to the vNIC. In theory, because the generation number matches – this ruleset should be consistent across all hosts. Let’s confirm:
[root@esx-a2:~] vsipioctl getrules -f nic-69414-eth0-vmware-sfw.2 ruleset domain-c121 { # Filter rules rule 1005 at 1 inout protocol tcp from addrset ip-securitygroup-11 to any port 80 drop with log; rule 1005 at 2 inout protocol tcp from addrset ip-securitygroup-11 to any port 443 drop with log; rule 1003 at 3 inout protocol ipv6-icmp icmptype 135 from any to any accept; rule 1003 at 4 inout protocol ipv6-icmp icmptype 136 from any to any accept; rule 1002 at 5 inout protocol udp from any to any port 68 accept; rule 1002 at 6 inout protocol udp from any to any port 67 accept; rule 1001 at 7 inout protocol any from any to any accept with log; } ruleset domain-c121_L2 { # Filter rules rule 1004 at 1 inout ethertype any stateless from any to any accept; }
Sure enough, we can see that rule 1005 is there and applied to lubuntu-1. So that begs the question – why doesn’t it block HTTP traffic from this VM? Well, have a closer look at rule 1005. What don’t we see in the rule? IP addresses. Sure, the rule is there but how do we know what ip-securitygroup-11 actually contains?
Using vsipioctl again, we’ll find out:
[root@esx-a2:~] vsipioctl getaddrsets -f nic-69414-eth0-vmware-sfw.2 addrset ip-securitygroup-11 { ip 172.17.1.30, ip 172.17.1.104, ip fe80::250:56ff:fea6:4deb, }
Now we’re getting somewhere. From the first half, you’ll recall that lubuntu-1 has an IP address of 172.17.1.101. It’s not contained in the address set despite the VM being in the security group from the UI!
Just because a dynamic object has been included in a security group doesn’t mean that NSX knows it’s IP address.
IP Discovery
One of the best features of the DFW is the flexibility it provides in using objects in rules instead of IP addresses or groups of IP addresses. For example, for a source/destination you could use a VM in the inventory, a cluster or a security group containing all sorts of dynamic criteria. Underneath all of this, however, the ESXi host needs to be able to inspect segment and packet headers to enforce the rules. These headers are only going to contain identifying information like IP addresses and TCP ports etc. NSX must keep track of which object is associated with which IP address or addresses. It does this by populating address sets as seen in the earlier command output.
There are three ways in which NSX can associate IPs with VMs – VMware Tools reporting, ARP snooping and DHCP snooping. The later two are disabled by default.
In NSX 6.4.0, a column has been added in the host preparation section to display the enabled IP detection methods. As can be seen above, DHCP and ARP snooping are disabled leaving only VMware Tools address reporting.
ARP snooping can be very effective for IP detection as all VMs communicating at layer-3 will have to ARP out at some point. ESXi will intercept these packets and keep track of the VM IPs. The only unfortunate thing about ARP snooping is that that in many releases of NSX, having more than one IP address per vNIC can cause problems. I’d recommend looking at VMware KB 2147907.
Thankfully there are very few use cases where VMware Tools installation is not feasible. Today, the vast majority of Linux distros include open-vm-tools or it can be easily installed. Companies providing virtual appliances also realize the importance of tools and will quite often ensure it’s included.
As you can see above, someone simply forgot to install tools in this lubuntu 16.04 VM. The win-a1 and lubuntu-2 machines had it installed.
After installing tools, firewall rule 1005 began to work exactly as intended and we can now see the IP address included in the address set.
[root@esx-a2:~] vsipioctl getaddrsets -f nic-69414-eth0-vmware-sfw.2 addrset ip-securitygroup-11 { ip 172.17.1.30, ip 172.17.1.101, ip 172.17.1.104, ip fe80::250:56ff:fea6:4deb, ip fe80::250:56ff:fea6:8d4c, }
Conclusion
And there you have it! The importance of IP detection can not be understated when using inventory objects in rules and security groups. Admittedly, I went about troubleshooting this problem in a roundabout way, but being methodical allowed me to illustrate some important points on how the DFW works.
I hope this was useful. Please keep the troubleshooting scenario suggestions coming! Please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)