NSX Troubleshooting Scenario 12

Welcome to the twelfth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

For this scenario today, I’ve created some supplementary video content to go along this post:

The Scenario

As always, we’ll start with a brief problem statement:

“I am just getting started with the NSX distributed firewall and see that the rules are not behaving as they should be. I have two VMs, linux-a2 and linux-a3 that should allow SSH from only one specific jump box. The linux-a3 VM can be accessed via SSH from anywhere! Why is this happening?”

To get started with this scenario, we’ll most certainly need to look at how the DFW rules are constructed to get the desired behavior. The immense flexibility of the distributed firewall allows for dozens of different ways to achieve what is described.

Here are the two VMs in question:

tshoot12a-4
The VM linux-a2 is currently on host esx-a1 with IP address 172.16.15.10. It’s sitting on a logical switch.

And linux-a3:

tshoot12a-5
The VM linux-a3 is currently on host esx-a3 with IP address 172.16.15.11. It’s sitting on a VLAN backed dvPortgroup.

There are a couple of interesting observations above. The first is that both VMs have a security tag applied called ‘Linux-A VMs’. The other is a bit more of an oddity – one VM is in a distributed switch VLAN backed portgroup called dvpg-a-vlan15, and the other is in a VXLAN backed logical switch. Despite this, both VMs are in the same 172.16.15.0/24 subnet.

Below is some basic information from the environment, including IP addresses and locations of pertinent VMs:

linux-a2
IP address: 172.16.15.10/24
Host: esx-a1
Cluster: compute-a

linux-a3
IP address: 172.16.15.11/24
Host: esx-a3
Cluster: compute-a

2k12-m1 (jump box, should be allowed)
IP address: 172.16.1.151/24
Host: esx-m1
Cluster: management

cc (jump box, should NOT be allowed)
IP address: 172.16.1.100/24
Host: esx-m1
Cluster: management

To reproduce this issue, we’ve been using a different jump box VM called ‘cc’ that should not be allowed to connect to either linux-a2 or linux-a3 via SSH. When we try to connect, we can see that we’re not able to connect to linux-a2, but we are always able to connect to linux-a3 successfully:

tshoot12a-1
The linux-a3 VM always allows SSH for some reason.

Looking at the NSX ruleset, we can see two rules at the very top that should be giving us the desired outcome:

tshoot12a-2
At first glance, rules 1007 and 1008 appear to be correct and should give the desired outcome.

An allow rule specifically calls out the VM object 2k12-m1 as the source, and a security group called ‘Linux-A VMs’ as the destination. Below that, a source of ‘Any’ is rejected for SSH traffic to the same security group. At first glance, these two rules look like they should do the trick. Here is what we know:

  1. The VM 2k12-m1 should be allowed to access SSH on the VMs contained within the security group ‘Linux-A VMs’.
  2. All other source VMs and IPs should be rejected when trying to access SSH on the VMs contained within in the security group ‘Linux-A VMs’.

Now naturally, the question becomes – what VMs are contained in the security group ‘Linux-A VMs’? Thankfully NSX makes this easy to figure out via the UI. Regardless of how this security group was constructed, we can see that ultimately the two VMs made it inside.

tshoot12a-3
The two VMs made it into the security group. Probably via the security tags they had attached.

But can NSX translate these VM names into usable IP addresses? Remember – all the inventory objects that can be added to DFW rules must ultimately be converted to IP addresses for processing. Clicking IP addresses in the ‘Effective Members’ dialog should tell us.

tshoot12a-6
It appears that the expected IP addresses were obtained from the VMs. This was likely obtained using VMware Tools.

And sure enough, both IPs have been translated successfully.

So why is 172.16.15.11 behaving differently? We can see that logging is enabled for the two rules. Let’s have a quick look at the dfwpktlogs.log file on host esx-a1 and esx-a3 for some clues.

[root@esx-a1:/var/log] cat dfwpktlogs.log |grep -i 172.16.15.
2018-11-25T16:47:21.175Z 60860 INET match REJECT domain-c41/1008 IN 52 TCP 172.16.1.100/50284->172.16.15.10/22 S
2018-11-25T16:47:21.690Z 60860 INET match REJECT domain-c41/1008 IN 52 TCP 172.16.1.100/50284->172.16.15.10/22 S
2018-11-25T16:47:22.205Z 60860 INET match REJECT domain-c41/1008 IN 52 TCP 172.16.1.100/50284->172.16.15.10/22 S

On host esx-a1 where linux-a2 resides, we can see exactly what we expect. Traffic sourced from 172.16.1.100 (cc) to 172.16.15.10 is listed as matched and rejected by rule 1008. This matches the behavior from the SSH client and the ‘connection refused’ response we’ve been getting.

What about host esx-a3?

[root@esx-a3:~] cat /var/log/dfwpktlogs.log |grep -i 172.16.15.
[root@esx-a3:~]

Very odd – nothing at all.

What’s Next

I’ll post the solution in the next day or two, but what would you check next? Why would there be no logging for rules clearly configured to log on host esx-a3? What this might tell us?

How would you handle this scenario? Let me know! Please feel free to leave a comment below or via Twitter (@vswitchzero).

Leave a comment