NSX Troubleshooting Scenario 6

Welcome to the sixth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

NSX Troubleshooting Scenario 6

As always, we’ll start with a brief customer problem statement:

“Help! It looks like we accidentally blocked access to vCenter Server! We have two clusters, a compute and a management cluster. My colleague noticed the firewall was disabled on the management cluster and turned it on. As soon as he did that we lost all access to the vSphere Web Client.”

Well, this sounds like a ‘chicken or the egg dilemma’ – how can they recover if they can’t log in to the vSphere Web Client to revert the changes that broke things?

In speaking with our fictional customer, we learn that some rules are in place to block all HTTP/HTTPS access in the compute cluster. Because they are still deploying VMs and getting everything patched, they are using this as a temporary means to prevent all web access. Unfortunately, he can’t remember exactly what was configured in the firewall and there may be other restrictions in place.

This was a screenshot of the last thing he saw before his web client session started timing out:

tshoot6a-1

Starting with some basic ping tests, we can see that the vCenter Server and NSX Manager are both still accessible from a layer-3 perspective:

tshoot6a-2

NSX Manager should always remain accessible regardless of DFW changes because it’s included in the NSX DFW exclusion list by default. This isn’t the case with vCenter Server and other core infrastructure VMs that may be strewn about.

Everything was fine until the customer enabled the DFW in the management cluster, so some of the rules must have impacted web client functionality. Based on the symptoms, it’s probably HTTPS port 443 that’s being blocked as the customer suspects.

Since I have access to an ESXi host with it’s management vmkernel port in the same subnet as vCenter, I’ll use netcat to do some tests:

[root@esx0:~] esxcli network firewall unload
[root@esx0:~] nc -zv 172.16.10.15 80
nc: connect to 172.16.10.15 port 80 (tcp) failed: Connection timed out
[root@esx0:~] nc -zv 172.16.10.15 443
nc: connect to 172.16.10.15 port 443 (tcp) failed: Connection timed out

Note: I had to disable the ESXi firewall to allow netcat to work unhindered. The ESXi firewall is a very basic stateless firewall used for ESXi’s vmkernel networking and has nothing to do with the NSX DFW.

If I repeat this with other ports I know vCenter Server is listening on – like 1514 for syslog collection and 5480 for the appliance management interface – I get successful connections:

[root@esx0:~] nc -zv 172.16.10.15 1514
Connection to 172.16.10.15 1514 port [tcp/*] succeeded!
[root@esx0:~] nc -zv 172.16.10.15 5480
Connection to 172.16.10.15 5480 port [tcp/*] succeeded!
[root@esx0:~] esxcli network firewall load

I reloaded the ESXi firewall when I was finished using netcat. Port 80 and 443 are being blocked as the customer feared. Let’s see if we can get a better understanding of the rules applied to the management cluster and the possible consequences.

There are a couple of different ways to get the current rule set. Because NSX Manager is automatically excluded from the DFW, we should be able to use API calls to get the firewall rules. Another way would be to query the dvFilter applied to a VM from the ESXi host it’s registered on. Since I’m already logged into the ESXi host where vCenter resides, I’ll use the later method.

Before we do that, let’s confirm that the DFW was indeed applied to the vCenter Server. We can do this by looking at the output of the summarize-dvfilter command:

[root@esx0:~] summarize-dvfilter
<snip>
world 69262 vmm0:vc vcUuid:'52 bf b9 f3 99 9b ac bc-16 33 a0 d4 d7 74 59 42'
 port 33554445 vc
 vNic slot 2
 name: nic-69262-eth0-vmware-sfw.2
 agentName: vmware-sfw
 state: IOChain Attached
 vmState: Attached
 failurePolicy: failClosed
 slowPathID: 2
 filter source: Dynamic Filter Creation
<snip>

Notice that the VM called vc has a slot 2 dvfilter applied. The mere existence of this slot 2 filter applied to its vNIC proves that the vCenter Server is now subject to all of the DFW rules configured.

Next, let’s get a glimpse of the rules associated with the filter above called nic-69262-eth0-vmware-sfw.2 using the vsipioctl command:

[root@esx0:~] vsipioctl getrules -f nic-69262-eth0-vmware-sfw.2
ruleset domain-c205 {
 # Filter rules
 rule 1006 at 1 inout protocol tcp from ip 172.16.0.0/12 to any port 80 drop with log;
 rule 1006 at 2 inout protocol tcp from ip 172.16.0.0/12 to any port 443 drop with log;
 rule 1007 at 3 inout protocol tcp from ip 172.16.0.0/12 to any port 22 drop;
 rule 1008 at 4 inout protocol tcp from ip 172.17.0.0/16 to ip 172.16.0.0/16 port 3389 drop;
 rule 1003 at 5 inout protocol ipv6-icmp icmptype 135 from any to any accept;
 rule 1003 at 6 inout protocol ipv6-icmp icmptype 136 from any to any accept;
 rule 1002 at 7 inout protocol udp from any to any port 68 accept;
 rule 1002 at 8 inout protocol udp from any to any port 67 accept;
 rule 1001 at 9 inout protocol any from any to any accept with log;
}

ruleset domain-c205_L2 {
 # Filter rules
 rule 1004 at 1 inout ethertype any stateless from any to any accept;
}

And there you have it – the entire 172.x internal RFC 1918 IP space is being blocked for both HTTP and HTTPS. Rules 1006 and 1007 appear to be the culprits. The fact that these rules made it to the management cluster implies that the ‘applied to’ field of the DFW was not utilized to limit these rules to the compute cluster.

So in summary, we know the following:

  • All HTTP/HTTPS traffic from a 172.x internal address will be blocked.
  • All SSH traffic from a 172.x internal address will be blocked.
  • RDP port 3389 traffic is blocked from the 172.17.0.0/16 subnet to 172.16.0.0/16.
  • All other traffic sources, destinations and protocols appear to be allowed as the default rule is any/any accept.
  • The vCenter Server was definitely not added to the NSX exclusion list as we can see the slot-2 dvfilter applied.
  • Other VMs in the management cluster may be similarly impacted. At this time we don’t know what other services are broken as a result of these rules.

What’s Next

Obviously, we’ve got to find a way to undo what was changed here. If you look around in the VMware knowledge base or the NSX documentation you can probably find a sledge hammer type solution without too much difficulty. That said, I can think of at least three different ways we can go about fixing this one. The question is – which one will be the quickest, easiest and leave the least amount collateral damage in it’s wake.

How would you handle this scenario? Let me know! The solution for scenario 6 is now available. Please feel free to leave a comment below or via Twitter (@vswitchzero).

Solution to Scenario 6    >>

Leave a comment