NSX Troubleshooting Scenario 12 – Solution

Welcome to the twelfth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was getting some unexpected behavior from a couple of firewall rules. Despite the rules being properly constructed, one VM called linux-a3 continued to be accessible via SSH.

tshoot12a-2
The two rules in question – 1007 and 1008 – look to be constructed correctly.

We confirmed that the IP addresses for the machines in the security group where translated correctly by NSX and that the ruleset didn’t appear to be the problem. Let’s recap what we know:

  1. VM linux-a2 seems to be working correctly and SSH traffic is blocked.
  2. VM linux-a3 doesn’t seem to respect rule 1007 for some reason and remains accessible via SSH from everywhere.
  3. Host esx-a3 where linux-a3 resides doesn’t appear to log any activity for rule 1007 or 1008 even though those rules are configured to log.
  4. The two VMs are on different ESXi hosts (esx-a1 and esx-a3).
  5. VMs linux-a2 and linux-a3 are in different dvPortgroups.

Given these statements, there are several things I’d want to check:

  1. How can the two VMs have proper IP connectivity in VXLAN and VLAN porgroups as observed?
  2. Is the DFW working at all on host esx-a3?
  3. Did the last rule publication make it to host esx-a3 and does it match what we see in the UI?
  4. Is the DFW (slot-2) dvfilter applied to linux-a3 correctly?

Before we begin, let’s have a quick look at how linux-a2 and linux-a3 could be on different portgroups – one VXLAN and one VLAN – yet still be in the same subnet. If my hunch is correct, there must be a layer-2 bridge or a hardware VTEP in place for this to be working correctly.

tshoot12b-2
An L2 bridge explains how the two VMs could be on on VXLAN and VLAN backed portgroups.

Sure enough, we can see that dlr-a1 is bridging between dvpg-a-vlan15 and the logical switch called ‘Purple Network’. The fact that one VM is attached to a VLAN backed portgroup and the other to a logical switch has no bearing on DFW functionality. Because the DFW filters at each vNic, where it is connected makes no difference. We know that layer-3 connectivity appears good to both VMs, so I don’t think we need to look any further into this point.

Next, we’ll do a quick migration test and move VM linux-a2 to host esx-a3, and see if we get the same behavior. This would help to rule out a host-level problem. We’ll also move linux-a3 to host esx-a1.

tshoot12a-1
It doesn’t seem to matter which host the VMs are on. The same behavior persists.

Despite this, we can see that the behavior persists. The location of the VMs doesn’t appear to be playing a role here. After testing this, I moved the VMs back to their original locations to keep things consistent.

Next, let’s login to host esx-a1 to have a look at the linux-a2 VM that seems to be working correctly. We’ll collect some data for comparison purposes.

First, we’ll run the summarize-dvfilter command to get the vNic filters for linux-a2:

[root@esx-a1:~] summarize-dvfilter
Fastpaths:
agent: dvfilter-faulter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter
agent: ESXi-Firewall, refCount: 6, rev: 0x1010000, apiRev: 0x1010000, module: esxfw
agent: dvfilter-generic-vmware, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter-generic-fastpath
agent: dvfilter-generic-vmware-swsec, refCount: 5, rev: 0x1010000, apiRev: 0x1010000, module: nsx-dvfilter-switch-security
agent: bridgelearningfilter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vdrb
agent: vmware-sfw, refCount: 5, rev: 0x1010000, apiRev: 0x1010000, module: nsx-vsip

Slowpaths:
slowPath: 2, agent vmware-sfw, refCount: 4, rev: 0x4, apiRev: 0x3, capabilities:

Filters:
<snip>
world 69977 vmm0:linux-a2 vcUuid:'50 26 f0 a9 f9 5e af 2a-e5 b2 19 f8 91 f5 03 e5'
 port 67108884 linux-a2.eth0
  vNic slot 2
  name: nic-69977-eth0-vmware-sfw.2
 agentName: vmware-sfw
   state: IOChain Attached
   vmState: Attached
   failurePolicy: failClosed
   slowPathID: 2
   filter source: Dynamic Filter Creation
  vNic slot 1
  name: nic-69977-eth0-dvfilter-generic-vmware-swsec.1
 agentName: dvfilter-generic-vmware-swsec
   state: IOChain Attached
   vmState: Detached
   failurePolicy: failClosed
   slowPathID: none
   filter source: Alternate Opaque Channel
<snip>

The filter that’s associated with the DFW is called vmware-sfw and always occupies slot-2. We can see here that there is indeed a slot-2 filter attached to linux-a2. The filter name is nic-69977-eth0-vmware-sfw.2

Next, let’s have a look at the ruleset that actually made it to host esx-a1 and ultimately to the vNIC filter that’s used by the DFW for this guest:

[root@esx-a1:~] vsipioctl getrules -f nic-69977-eth0-vmware-sfw.2
ruleset domain-c41 {
  # Filter rules
  rule 1007 at 1 inout protocol tcp from addrset ip-vm-341 to addrset ip-securitygroup-11 port 22 accept with log;
  rule 1008 at 2 inout protocol tcp from any to addrset ip-securitygroup-11 port 22 reject with log;
  rule 1003 at 3 inout protocol ipv6-icmp icmptype 136 from any to any accept;
  rule 1003 at 4 inout protocol ipv6-icmp icmptype 135 from any to any accept;
  rule 1002 at 5 inout protocol udp from any to any port 67 accept;
  rule 1002 at 6 inout protocol udp from any to any port 68 accept;
  rule 1001 at 7 inout protocol any from any to any accept;
}

ruleset domain-c41_L2 {
  # Filter rules
  rule 1004 at 1 inout ethertype any stateless from any to any accept;
}

We can clearly see that rules 1007 and 1008 made it to this vNIC filter and appear to match what we saw in the UI earlier. Just to be thorough, we can check to ensure that the object identifiers ip-vm-341 and ip-securitygroup-11 contain the IPs we expect:

[root@esx-a1:~] vsipioctl getaddrsets -f nic-69977-eth0-vmware-sfw.2
addrset ip-securitygroup-11 {
ip 172.16.15.10,
ip 172.16.15.11,
ip fe80::250:56ff:fea6:902,
ip fe80::250:56ff:fea6:c0b0,
}
addrset ip-vm-341 {
ip 172.16.1.151,
}

Sure enough, they match perfectly.

We can also compare the ‘generation number’ of the ruleset on host esx-a1 to ensure it matches the last publication date from NSX manager.

[root@esx-a1:~] vsipioctl loadruleset |head -9
Loading ruleset file: /etc/vmware/vsfwd/vsipfw_ruleset.dat
##################################################
#             ruleset message dump               #
##################################################
ActionType      : replace
Id              : domain-c41
Name            : domain-c41
Generation      : 1543456074899
Rule Count      : 8

The generation number represents the point in time a publish operation occurs. It’s actually a Unix epoch timestamp (in milliseconds) that can be converted to an actual date/time. In this instance, the number above equates to Thursday, November 29, 2018 1:47:54.899 AM UTC. There are a number of online tools to convert these values, including this one.

We can find this in the NSX manager logging for comparison purposes. Searching for the message related to the pushing of rules to the clusters is usually a safe bet:

[root@nsxmanager /home/secureall/secureall/logs]# cat vsm.log |grep "Sending rules to Cluster"
<snip>
2018-11-29 01:47:55.317 GMT+00:00  INFO TaskFrameworkExecutor-9 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: null Object Generation Number 1543456074899.
2018-11-29 01:47:57.422 GMT+00:00  INFO TaskFrameworkExecutor-16 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: 1543337228980 Object Generation Number 1543456074899.

The last instance of rule publication in the logging references the same generation number. The log timestamps also match when converted from epoch format. You can determine if the cluster moref matches the one you are interested in by looking it up here.

Now let’s shift our focus to esx-a3. If the last publication task was successful, it should also have a matching generation number:

[root@esx-a3:~] vsipioctl loadruleset |head -9
Loading ruleset file: /etc/vmware/vsfwd/vsipfw_ruleset.dat
##################################################
#             ruleset message dump               #
##################################################
ActionType      : replace
Id              : domain-c41
Name            : domain-c41
Generation      : 1543456074899
Rule Count      : 8

Sure enough, it does. We can rest assured that esx-a1 and esx-a3 have the exact same ruleset. Let’s take a look at the filters applied to linux-a3:

[root@esx-a3:~] summarize-dvfilter |grep -i -A10 linux-a3
[root@esx-a3:~]

Nothing at all! This means that there are no dvfilters applied whatsoever to the linux-a3 VM. Without a slot-2 dvfilter, the DFW can’t possibly filter traffic for this VM. There are really only two possible explanations for this:

  1. The slot-2 dvfilter failed to apply to the vNIC of this VM for some reason – either due to a host problem or an issue with the VM itself. The VM would likely have no connectivity at all if this was the case.
  2. The linux-a3 VM is in the DFW exclusion list. The exclusion list is not like a special ‘allow all’ rule, but rather the complete lack of a slot-2 dvfilter. There is no DFW inspection whatsoever on excluded VMs.
tshoot12b-5
The linux-a3 VM is on the DFW exclusion list. Since this is not a system VM, someone must have added it here.

And there you have it. Sometimes it’s just something very simple that can explain odd behavior. After discussing further with our fictional customer it turns out someone added it to the exclusion list to do some testing and just forgot to remove it again.

After removing it, we immediately see the slot-2 filter applied again:

[root@esx-a3:~] summarize-dvfilter |grep -i -A10 linux-a3
world 70806 vmm0:linux-a3 vcUuid:'50 26 9d 4b 32 5c 18 71-c8 b0 46 ca 25 63 74 8c'
 port 67108882 linux-a3.eth0
  vNic slot 2
  name: nic-70806-eth0-vmware-sfw.2
 agentName: vmware-sfw
   state: IOChain Attached
   vmState: Attached
   failurePolicy: failClosed
   slowPathID: 2
   filter source: Dynamic Filter Creation

SSH now works exactly as expected.

Reader Feedback

I got some great feedback from a few individuals on Twitter this time around. Some excellent suggestions from @vVadster, who was right on the money:

As usual, @alagoutte recognized the potential problem almost immediately 🙂

Conclusion

Understanding how the DFW works behind the scenes is important to effectively troubleshoot it. I hope this scenario was helpful. If you have any questions or have suggestions for future scenarios, please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)

Leave a comment