NSX Troubleshooting Scenario 5 – Solution

Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.

Please see the first half for more detail on the problem symptoms and some scoping.

Reader Suggestions

There were a few good suggestions from readers. Here are a couple from Twitter:

Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.

The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.

Getting Started

In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.

tshoot5a-1

Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:

  • VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
  • VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
  • The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
  • The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
  • All three VMs have the correct security tag applied.
  • All three VMs are indeed showing up correctly in the security group due to the tag.
  • The two working VMs are on host esx-a1 and the broken VM is on host esx-a2

To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.

First, let’s ensure the vShield-Stateful-Firewall service is running on the host. This service doesn’t actually perform the filtering but it is responsible for RabbitMQ management plane communication to NSX manager and the synchronization of firewall rules.

[root@esx-a2:~] /etc/init.d/vShield-Stateful-Firewall status
vShield-Stateful-Firewall is running

[root@esx-a2:~] esxcli network ip connection list |grep 5671
tcp 0 0 172.16.10.22:53998 172.16.10.40:5671 ESTABLISHED 68767 newreno vsfwd
tcp 0 0 172.16.10.22:21529 172.16.10.40:5671 ESTABLISHED 68767 newreno vsfwd
tcp 0 0 172.16.10.22:39057 172.16.10.40:5671 ESTABLISHED 68767 newreno vsfwd

We can also see that there are established TCP 5671 sockets with NSX Manager. Based on this, we know there is a communication channel open to NSX Manager and that any firewall changes should get pushed down to this host.

Next, let’s ensure that the lubuntu-1 virtual machine has the DFW dvFilter applied to it.

[root@esx-a2:~] summarize-dvfilter
Fastpaths:
agent: dvfilter-faulter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter
agent: ESXi-Firewall, refCount: 6, rev: 0x1010000, apiRev: 0x1010000, module: esxfw
agent: dvfilter-generic-vmware, refCount: 2, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter-generic-fastpath
agent: dvfilter-generic-vmware-swsec, refCount: 2, rev: 0x1010000, apiRev: 0x1010000, module: nsx-dvfilter-switch-security
agent: bridgelearningfilter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vdrb
agent: vmware-sfw, refCount: 3, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vsip

Slowpaths:
slowPath: 2, agent vmware-sfw, refCount: 2, rev: 0x4, apiRev: 0x3, capabilities:

Filters:
<snip>
world 69414 vmm0:lubuntu-1 vcUuid:'50 26 1a b6 b1 b8 cc f3-40 22 71 ac 8c d1 29 df'
 port 50331659 lubuntu-1.eth0
 vNic slot 2
 name: nic-69414-eth0-vmware-sfw.2
 agentName: vmware-sfw
 state: IOChain Attached
 vmState: Attached
 failurePolicy: failClosed
 slowPathID: 2
 filter source: Dynamic Filter Creation
 vNic slot 1
 name: nic-69414-eth0-dvfilter-generic-vmware-swsec.1
 agentName: dvfilter-generic-vmware-swsec
 state: IOChain Attached
 vmState: Detached
 failurePolicy: failClosed
 slowPathID: none
 filter source: Alternate Opaque Channel

The summarize-dvfilter command is extremely useful in DFW troubleshooting and can tell us a lot. There are two key things here that I want to check. First, we should see a FastPath filter called vmware-sfw listed and associated with the module called nsx-vsip. This tells us that the DFW module on the host is running. Second, we need to look in the output for the virtual machine lubuntu-1. We need to ensure that a slot-2 dvFilter is applied to the VM. Slot-2 is the position in the dvFilter I/O chain for the NSX distributed firewall and never changes.

In the output above, we can see that a slot-2 filter called nic-69414-eth0-vmware-sfw.2 is indeed applied to lubuntu-1. We know that some sort of DFW filtering will be done on this VM, but the question remains – why is lubuntu-1 behaving differently than the other two VMs?

The next thing I want to check is to ensure the DFW rule set is up-to-date and synchronized on host esx-a2. According to the customer, on some point on February 22nd, this new firewall rule was published and pushed down to hosts in the environment. How do we know that esx-a2 successfully received this information? In theory, lubuntu-1 is behaving as if rule 1005 simply doesn’t exist. If that rule didn’t make it to the host, that could explain the problem.

Confirming DFW Synchronization

As a next step, there are two things we are going to check. First, we want to ensure the DFW configuration is synchronized on host esx-a2. Second, we’ll take a look at the actual filter ruleset that is applied to lubuntu-1 to ensure that rule 1005 exists.

Firewall synchronization can be checked several different ways, but the best is to look for what’s referred to as the ‘generation number’. The generation number is essentially just a date represented in ‘epoch’ or Unix time code used for version control. Every time NSX pushes DFW configuration to a host, it should do so with a generation number. At any given time, the generation number in ESXi’s configuration should match that of NSX manager.

First, let’s use a simple API call to get the generation number from NSX’s perspective:

GET https://nsxmanager.lab.local/api/4.0/firewall/globalroot-0/config

The complete DFW configuration will be returned, but we’re only interested in the generation number near the top of the output:

<?xml version="1.0" encoding="UTF-8"?>
<firewallConfiguration timestamp="1519339055838">
 <contextId>globalroot-0</contextId>
 <layer3Sections>
 <section id="1004" name="Test Section" generationNumber="1519339055838" timestamp="1519339055838" tcpStrict="false" stateless="false" useSid="false" type="LAYER3">
 <rule id="1005" disabled="false" logged="true">
 <name>Browse Tag Enforce</name>
 <action>deny</action>
 <appliedToList>
 <appliedTo>
 <name>compute-a</name>
 <value>domain-c121</value>
 <type>ClusterComputeResource</type>
 <isValid>true</isValid>
 </appliedTo>
<snip>

So according to NSX Manager, the latest firewall configuration ‘version’ was generation number 1519339055838. This epoch or ‘Unix Time’ number can easily be converted by any number of epoch converter sites online. This number represents Thursday, February 22, 2018 10:37:35.838 PM UTC.

Next, we need to compare this to the generation number in ESXi. There are a few different ways this can be done, but I’ll simply look in /var/log/vsfwd.log for the publish instructions. When NSX Manager pushes this down, the vsfwd.log file will record the generation number as well as the application of new rules to the vNIC filters.

2018-02-22T22:34:20Z vsfwd: [INFO] Applying firewall config to vnic list on host host-225
2018-02-22T22:34:20Z vsfwd: [INFO] Applied RuleSet 1519339055838 on vnic 50261ab6-b1b8-ccf3-4022-71ac8cd129df.000
2018-02-22T22:34:20Z vsfwd: [INFO] Applied RuleSet 1519339055838 on vnic 50264b96-2cf4-8a8c-72c2-0dc31f61c109.000
2018-02-22T22:34:20Z vsfwd: [INFO] Applied RuleSet 1519339055838 for all vnics

As you can see above, the host did indeed react upon the new firewall configuration on the 22nd and applied this ruleset to vNICs. If you look back at the summarize-dvfilter output, you’ll remember that vnic 50261ab6-b1b8-ccf3-4022-71ac8cd129df.000 is associated with lubuntu-1.

Another way you can get an ESXi host’s generation number is to run the following vsipioctl command:

[root@esx-a2:~] vsipioctl loadruleset | head -10
Loading ruleset file: /etc/vmware/vsfwd/vsipfw_ruleset.dat
##################################################
# ruleset message dump #
##################################################
ActionType : replace
Id : domain-c121
Name : domain-c121
Generation : 1519339055838
Rule Count : 8
layer2 rule 1004 {

A Closer Look at the Filter

So now that we know the host got the new configuration and acted upon it, let’s have a closer look at the dvFilter in question. Looking back at the summarize-dvfilter output, we’ll first want to get the slot-2 filter name applied to the lubuntu-1 VM. In this case it is:
nic-69414-eth0-vmware-sfw.2.

With this filter name, we can query ESXi for the actual rulset applied to the vNIC. In theory, because the generation number matches – this ruleset should be consistent across all hosts. Let’s confirm:

[root@esx-a2:~] vsipioctl getrules -f nic-69414-eth0-vmware-sfw.2
ruleset domain-c121 {
 # Filter rules
 rule 1005 at 1 inout protocol tcp from addrset ip-securitygroup-11 to any port 80 drop with log;
 rule 1005 at 2 inout protocol tcp from addrset ip-securitygroup-11 to any port 443 drop with log;
 rule 1003 at 3 inout protocol ipv6-icmp icmptype 135 from any to any accept;
 rule 1003 at 4 inout protocol ipv6-icmp icmptype 136 from any to any accept;
 rule 1002 at 5 inout protocol udp from any to any port 68 accept;
 rule 1002 at 6 inout protocol udp from any to any port 67 accept;
 rule 1001 at 7 inout protocol any from any to any accept with log;
}

ruleset domain-c121_L2 {
 # Filter rules
 rule 1004 at 1 inout ethertype any stateless from any to any accept;
}

Sure enough, we can see that rule 1005 is there and applied to lubuntu-1. So that begs the question – why doesn’t it block HTTP traffic from this VM? Well, have a closer look at rule 1005. What don’t we see in the rule? IP addresses. Sure, the rule is there but how do we know what ip-securitygroup-11 actually contains?

Using vsipioctl again, we’ll find out:

[root@esx-a2:~] vsipioctl getaddrsets -f nic-69414-eth0-vmware-sfw.2
addrset ip-securitygroup-11 {
ip 172.17.1.30,
ip 172.17.1.104,
ip fe80::250:56ff:fea6:4deb,
}

Now we’re getting somewhere. From the first half, you’ll recall that lubuntu-1 has an IP address of 172.17.1.101. It’s not contained in the address set despite the VM being in the security group from the UI!

tshoot5a-2

Just because a dynamic object has been included in a security group doesn’t mean that NSX knows it’s IP address.

IP Discovery

One of the best features of the DFW is the flexibility it provides in using objects in rules instead of IP addresses or groups of IP addresses. For example, for a source/destination you could use a VM in the inventory, a cluster or a security group containing all sorts of dynamic criteria. Underneath all of this, however, the ESXi host needs to be able to inspect segment and packet headers to enforce the rules. These headers are only going to contain identifying information like IP addresses and TCP ports etc. NSX must keep track of which object is associated with which IP address or addresses. It does this by populating address sets as seen in the earlier command output.

There are three ways in which NSX can associate IPs with VMs – VMware Tools reporting, ARP snooping and DHCP snooping. The later two are disabled by default.

tshoot5b

In NSX 6.4.0, a column has been added in the host preparation section to display the enabled IP detection methods. As can be seen above, DHCP and ARP snooping are disabled leaving only VMware Tools address reporting.

ARP snooping can be very effective for IP detection as all VMs communicating at layer-3 will have to ARP out at some point. ESXi will intercept these packets and keep track of the VM IPs. The only unfortunate thing about ARP snooping is that that in many releases of NSX, having more than one IP address per vNIC can cause problems. I’d recommend looking at VMware KB 2147907.

Thankfully there are very few use cases where VMware Tools installation is not feasible. Today, the vast majority of Linux distros include open-vm-tools or it can be easily installed. Companies providing virtual appliances also realize the importance of tools and will quite often ensure it’s included.

tshoot5b-3

As you can see above, someone simply forgot to install tools in this lubuntu 16.04 VM. The win-a1 and lubuntu-2 machines had it installed.

After installing tools, firewall rule 1005 began to work exactly as intended and we can now see the IP address included in the address set.

[root@esx-a2:~] vsipioctl getaddrsets -f nic-69414-eth0-vmware-sfw.2
addrset ip-securitygroup-11 {
ip 172.17.1.30,
ip 172.17.1.101,
ip 172.17.1.104,
ip fe80::250:56ff:fea6:4deb,
ip fe80::250:56ff:fea6:8d4c,
}

Conclusion

And there you have it! The importance of IP detection can not be understated when using inventory objects in rules and security groups. Admittedly, I went about troubleshooting this problem in a roundabout way, but being methodical allowed me to illustrate some important points on how the DFW works.

I hope this was useful. Please keep the troubleshooting scenario suggestions coming! Please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)

 

Leave a comment