Understanding NSX DFW Generation Numbers

A useful tool for troubleshooting DFW publication failures.

If you’ve ever been on a support call for DFW publication or rule troubleshooting, you may have heard reference to a ‘firewall generation number’ at one time or another. Whenever a change is made to the firewall rules, the NSX management plane (NSX Manager) will push these changes to all ESXi hosts, where the rules will be enforced. Because of the distributed nature of this firewalling system, it’s very important that all ESXi hosts have the latest version of the ruleset.

The NSX UI does a good job of reporting on host publication failures, but its not always clear exactly what version of the rules a problematic host is enforcing.

This is where firewall generation numbers can come in handy. The ‘generation number’ represents the point in time a publish operation occurs. Although it may look like a seemingly random thirteen-digit number, it’s actually a Unix epoch timestamp (in milliseconds) that can be converted to an actual date/time. For example, an epoch timestamp of 1548677100000 equates to Monday, January 28th, 2019 at 12:05:00 UTC. There are several online tools available to help you convert these values, including this one.

An Example

Let’s have a look at the current generation number reported on a pair of ESXi hosts. One host, esx-a2 has been reporting publication failures.

To determine the generation number, you could in theory take the last reported publication date from the UI and convert it into a Unix epoch number. In my experience, there isn’t enough accuracy and you may not get an exact match. The better way to do it is to look for a “Sending rules to Cluster” log messages in the NSX manager vsm.log file. This can be done via SSH session, or more easily using a filter in vRealize Log Insight.

[root@nsxmanager /home/secureall/secureall/logs]# cat vsm.log |grep "Sending rules to Cluster"
<snip>
2018-11-29 01:47:55.317 GMT+00:00 INFO TaskFrameworkExecutor-9 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: null Object Generation Number 1543456074899.
2018-11-29 01:47:57.422 GMT+00:00 INFO TaskFrameworkExecutor-16 ConfigurationPublisher:110 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Sending rules to Cluster domain-c41, Generation Number: 1543337228980 Object Generation Number 1543456074899.

If you have multiple NSX prepared clusters, you’ll need to know which one you are looking at. For more information on looking up your cluster moref identifier, see this post.

The above command returned several instances of rule publication, but the most recent was generation number 1543456074899. This epoch number translates to Thursday, November 29, 2018 1:47:54.899 AM UTC. Notice in the second message, it reports the previous generation number of 1543337228980.

If you don’t have access to the logging, an alternative method you can use to get the generation number is a simple API call. This won’t allow you to get a history of ‘Sending rules to Cluster’ messages like the logging can, but it will give you the current generation number.

GET https://nsxmanager.lab.local/api/4.0/firewall/globalroot-0/config

The complete DFW configuration will be returned, but we’re only interested in the generation number near the top of the output:

<?xml version="1.0" encoding="UTF-8"?>
<firewallConfiguration timestamp="1543456074899">
 <contextId>globalroot-0</contextId>
 <layer3Sections>
 <section id="1004" name="Test Section" generationNumber="1519339055838" timestamp="1519339055838" tcpStrict="false" stateless="false" useSid="false" type="LAYER3">
 <rule id="1005" disabled="false" logged="true">
 <name>Browse Tag Enforce</name>
 <action>deny</action>
 <appliedToList>
 <appliedTo>
 <name>compute-a</name>
 <value>domain-c41</value>
 <type>ClusterComputeResource</type>
 <isValid>true</isValid>
 </appliedTo>
<snip>

Now that we know the most recent generation number that represents the current state of the firewall ruleset, we can check individual ESXi hosts to see if they match.

Although it’s possible to get the generation number from the vsfwd.log file, a vsipioctl command can make easy work of it. Let’s check host esx-a1 first:

[root@esx-a1:~] vsipioctl loadruleset |head -9
Loading ruleset file: /etc/vmware/vsfwd/vsipfw_ruleset.dat
##################################################
#              ruleset message dump              #
##################################################
ActionType : replace
Id : domain-c41
Name : domain-c41
Generation : 1543456074899
Rule Count : 8

On this host, we have eight rules, and the generation number matches. This tells us that the state of the ruleset at the management plane matches perfectly on this host. Now let’s check esx-a2 that has been having publication issues:

[root@esx-a2:~] vsipioctl loadruleset |head -9
Loading ruleset file: /etc/vmware/vsfwd/vsipfw_ruleset.dat
##################################################
#              ruleset message dump              #
##################################################
ActionType : replace
Id : domain-c41
Name : domain-c41
Generation : 1543337228980
Rule Count : 7

Converting this epoch timestamp to human readable date/time gives us Tuesday, November 27, 2018 4:47:08.980 PM UTC. As we saw earlier in the vsm.log file, this was the previous generation number before the last ruleset change. You’ll also notice the total rule count is 7 as opposed to 8 on esx-a1. Clearly, esx-a2 isn’t in-sync with the management plane and we’ll need to figure out why it’s not getting the latest ruleset.

I hope this was helpful. Please feel free to leave a comment below if you have any questions or comments.

Leave a comment