Welcome to the twelfth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.
Please see the first half for more detail on the problem symptoms and some scoping.
As you’ll recall in the first half, our fictional customer was getting some unexpected behavior from a couple of firewall rules. Despite the rules being properly constructed, one VM called linux-a3 continued to be accessible via SSH.
We confirmed that the IP addresses for the machines in the security group where translated correctly by NSX and that the ruleset didn’t appear to be the problem. Let’s recap what we know:
VM linux-a2 seems to be working correctly and SSH traffic is blocked.
VM linux-a3 doesn’t seem to respect rule 1007 for some reason and remains accessible via SSH from everywhere.
Host esx-a3 where linux-a3 resides doesn’t appear to log any activity for rule 1007 or 1008 even though those rules are configured to log.
The two VMs are on different ESXi hosts (esx-a1 and esx-a3).
VMs linux-a2 and linux-a3 are in different dvPortgroups.
Given these statements, there are several things I’d want to check:
How can the two VMs have proper IP connectivity in VXLAN and VLAN porgroups as observed?
Is the DFW working at all on host esx-a3?
Did the last rule publication make it to host esx-a3 and does it match what we see in the UI?
Is the DFW (slot-2) dvfilter applied to linux-a3 correctly?
Welcome to the twelfth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
For this scenario today, I’ve created some supplementary video content to go along this post:
As always, we’ll start with a brief problem statement:
“I am just getting started with the NSX distributed firewall and see that the rules are not behaving as they should be. I have two VMs, linux-a2 and linux-a3 that should allow SSH from only one specific jump box. The linux-a3 VM can be accessed via SSH from anywhere! Why is this happening?”
To get started with this scenario, we’ll most certainly need to look at how the DFW rules are constructed to get the desired behavior. The immense flexibility of the distributed firewall allows for dozens of different ways to achieve what is described.
Here are the two VMs in question:
There are a couple of interesting observations above. The first is that both VMs have a security tag applied called ‘Linux-A VMs’. The other is a bit more of an oddity – one VM is in a distributed switch VLAN backed portgroup called dvpg-a-vlan15, and the other is in a VXLAN backed logical switch. Despite this, both VMs are in the same 172.16.15.0/24 subnet.
As we saw in the first half of scenario 6, a fictional administrator enabled the DFW in their management cluster, which caused some unexpected filtering to occur. Their vCenter Server was no longer allowed the necessary HTTPS port 443 traffic needed for the vSphere Web Client to work.
Since we can no longer manage the environment or the DFW using the UI, we’ll need to revert this change using some other method.
As mentioned previously, we are fortunate in that NSX Manager is always excluded from DFW filtering by default. This is done to protect against this very type of situation. Because the NSX management plane is still fully functional, we should – in theory – still be able to relay API based DFW calls to NSX Manager. NSX Manager will in turn be able to publish these changes to the necessary ESXi hosts.
There are two relatively easy ways to fix this that come to mind:
Use the approach outlined in KB 2079620. This is the equivalent of doing a factory reset of the DFW ruleset via API. This will wipe out all rules and they’ll need to be recovered or recreated.
Use an API call to disable the DFW in the management cluster. This will essentially revert the exact change the user did in the UI that started this problem.
There are other options, but above two will work to restore HTTP/HTTPS connectivity to vCenter. Once that is done, some remediation will be necessary to ensure this doesn’t happen again. Rather than picking a specific solution, I’ll go through both of them.
Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.
Please see the first half for more detail on the problem symptoms and some scoping.
There were a few good suggestions from readers. Here are a couple from Twitter:
Can you run the below 2 commands to see if the firewall rules are deployed to the host?
show dfw host host-id summarize-dvfilter show dfw host hostID filter filterID rules
If you migrate the Ubuntu VM to esx-a1 do the rules apply properly on the VM?
Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.
Hi mike ! It is possible to get the status of VMware tools on Ubuntu VM ? 😉
The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.
In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.
Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:
VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
All three VMs have the correct security tag applied.
All three VMs are indeed showing up correctly in the security group due to the tag.
The two working VMs are on host esx-a1 and the broken VM is on host esx-a2
To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.
Welcome to the fifth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 5
As always, we’ll start with a brief customer problem statement:
“We’ve just deployed NSX and are doing some testing with the distributed firewall. We created a security tag that we can apply to VMs to prevent them from browsing the web. We applied this tag on three virtual machines. It seems to work on two of them, but the third can always browse the web! Something is not working here”
After speaking to the customer, we were able to collect a bit more information about the VMs and traffic flows in question. Below are the VMs that should not be able to browse:
win-a1.lab.local – 172.17.1.30 (static)
lubuntu-1.lab.local – 172.17.1.101 (DHCP)
lubuntu-2.lab.local – 172.17.1.104 (DHCP)
Only the VM called lubuntu-1 is still able to browse. The others are fine. The customer has been using an internal web server called web-a1.lab.local for testing. That machine is in the same cluster and has an IP address of 172.17.1.11. It serves up a web page on port 80. All of the VMs in question are sitting in the same logical switch and the customer reports that all east-west and north-south routing is functioning normally.
To begin, let’s have a look at the DFW rules defined.
As you can see, they really did just start testing as there is only one new section and a single non-default rule. The rule is quite simple. Any HTTP/HTTPS traffic coming from VMs in the ‘No Browser’ security group should be blocked. We can also see that both this rule and the default were set to log as part of the troubleshooting activities.