Welcome to the fifth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 5
As always, we’ll start with a brief customer problem statement:
“We’ve just deployed NSX and are doing some testing with the distributed firewall. We created a security tag that we can apply to VMs to prevent them from browsing the web. We applied this tag on three virtual machines. It seems to work on two of them, but the third can always browse the web! Something is not working here”
After speaking to the customer, we were able to collect a bit more information about the VMs and traffic flows in question. Below are the VMs that should not be able to browse:
- win-a1.lab.local – 172.17.1.30 (static)
- lubuntu-1.lab.local – 172.17.1.101 (DHCP)
- lubuntu-2.lab.local – 172.17.1.104 (DHCP)
Only the VM called lubuntu-1 is still able to browse. The others are fine. The customer has been using an internal web server called web-a1.lab.local for testing. That machine is in the same cluster and has an IP address of 172.17.1.11. It serves up a web page on port 80. All of the VMs in question are sitting in the same logical switch and the customer reports that all east-west and north-south routing is functioning normally.
To begin, let’s have a look at the DFW rules defined.
As you can see, they really did just start testing as there is only one new section and a single non-default rule. The rule is quite simple. Any HTTP/HTTPS traffic coming from VMs in the ‘No Browser’ security group should be blocked. We can also see that both this rule and the default were set to log as part of the troubleshooting activities.
Another point worth noting is that the ‘Applied To’ field was used to apply this rule only to VMs in the compute-a cluster.
Clicking on the Security Group called ‘No Browsing’, we see the three VMs the customer was mentioning. One Windows VM and two Linux VMs. Again, only lubuntu-1 is the problem.
Looking at the Hosts and Clusters view, we can see that all three VMs are on hosts in the compute-a cluster. One host in the cluster appears to be in powered off by DPM.
Looking at the individual VMs, we can see that they all have the security tag applied, and are correctly put into the ‘No Browsing’ security group as a result.
One interesting point to note is that the two working VMs are registered on host esx-a1.lab.local. The one that is not working is currently registered on host esx-a2.lab.local. According to the customer, host esx-a2 was recently rebuilt due to local storage failure.
It looks like the DFW is enabled on the compute-a cluster. All hosts are showing as enabled with a green installation status. Host esx-a3 is showing with a red alarm, but because it’s in standby mode, this isn’t a concern.
Let’s take a quick look at the Security Group in Service Composer:
It appears to be quite simple – dynamic membership is used to put any VM with the security tag ‘Tag_NoBrowser’ into the group. No other objects are explicitly included or excluded.
Since the customer has this rule set to ‘log’, we can check the dfwpktlogs.log file on the ESXi hosts to see what’s happening. If the customer had vRealize Log Insight, this would be a much easier task, but we can still look at these files manually via SSH.
We’ll start with the working VMs on host esx-a1. I will filter the log file to look for flows on port 80 headed to the web server web-a1 (172.17.1.11):
[root@esx-a1:/var/log] cat dfwpktlogs.log |grep -i 172.17.1.11/80 |grep S 2018-02-22T23:34:18.172Z 27958 INET match DROP domain-c121/1005 OUT 52 TCP 172.17.1.30/50999->172.17.1.11/80 S 2018-02-22T23:34:21.177Z 27958 INET match DROP domain-c121/1005 OUT 52 TCP 172.17.1.30/50998->172.17.1.11/80 S 2018-02-22T23:58:39.582Z 22535 INET match DROP domain-c121/1005 OUT 60 TCP 172.17.1.104/37720->172.17.1.11/80 S 2018-02-22T23:58:39.834Z 22535 INET match DROP domain-c121/1005 OUT 60 TCP 172.17.1.104/37722->172.17.1.11/80 S <snip>
In the log, there were numerous examples of TCP SYN segments matched and dropped from 172.17.1.30 (win-a1.lab.local) and 172.17.1.104 (lubuntu-2.lab.local). These flows hit rule 1005, so we know the rule does seem to work. This explains why those two guests can’t browse to a web page as is the expected behavior.
On the esx-a2.lab.local host, we’ll do a similar filter in the dfwpktlogs.log:
[root@esx-a2:/var/log] cat dfwpktlogs.log |grep -i 172.17.1.11/80 |grep S 2018-02-22T23:00:53.723Z 39916 INET match PASS domain-c121/1001 OUT 60 TCP 172.17.1.101/38526->172.17.1.11/80 S 2018-02-22T23:00:53.723Z 39916 INET match PASS domain-c121/1001 OUT 60 TCP 172.17.1.101/38528->172.17.1.11/80 S 2018-02-22T23:00:53.723Z 39916 INET match PASS domain-c121/1001 OUT 60 TCP 172.17.1.101/38530->172.17.1.11/80 S 2018-02-22T23:00:54.031Z 39916 INET match PASS domain-c121/1001 OUT 60 TCP 172.17.1.101/38538->172.17.1.11/80 S
Interestingly, these flows still hit the dfwpktlogs.log file, so we know that they are being inspected by the distributed firewall. What’s off, however is that the flows are being passed by rule 1001. This is the default catch-all ‘allow rule’ at the bottom of the DFW.
If you are interested, have a look through the information provided above and let me know what you would check or what you think the problem may be! I want to hear your suggestions!
** As of 2/26/2018 the solution to scenario 5 has been posted. You can find it here.
What other information would you need to see? What tests would you run? What do you know is NOT the problem based on the information and observations here?
I will update this post with a link to the solution as soon as it’s completed. Please feel free to leave a comment below or via Twitter (@vswitchzero).