NSX Troubleshooting Scenario 13 – Solution

Welcome to the thirteenth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was having issues re-deploying ESGs and DLRs that had been removed from the vCenter inventory. This was all part of a cleanup activity that occurred due to a SAN outage.

tshoot13a-5
Not a very informative error message. We’ll need to refer to the logging to find out more.

To begin, we’ll really need more information on exactly why NSX is failing to re-deploy these ESGs and DLRs. The message in the UI is not very informative. As mentioned, there are no failed tasks in the vCenter recent tasks pane when an attempt is made, so we’ll need to go digging into the logging to find out more.

Taking a look at the vsm.log file on the NSX Manager appliance, we can see a backtrace occur at the same time as the deployment attempt:

2018-12-06 23:43:17.681 GMT+00:00 ERROR TaskFrameworkExecutor-19 Worker:219 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] BaseException thrown while executing task instance taskinstance-100656 com.vmware.vshield.edge.exception.EdgeVmDeploymentFailedException: nested exception is com.vmware.vshield.vsm.inventory.vcoperations.OvfManagerInternalErrorException:
core-services:1100:OVF Manager internal error. For more details, refer to the rootCauseString or the VC logs:
Managed object id datastore-26 of type Datastore was not found in VC.

The key part of the message that is of interest is the following:

“Managed object id datastore-26 of type Datastore was not found in VC.”

Continue reading “NSX Troubleshooting Scenario 13 – Solution”

NSX Troubleshooting Scenario 13

Welcome to the thirteenth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“After recovering from a storage outage, we’re unable to re-deploy any of our missing DLRs and ESGs. Help!”

With this type of a problem description, the first order of business is to find out EXACTLY what happened. After a lengthy discussion with the fictional customer, we were able to piece together the following sequence of events:

  1. The SAN suffered a catastrophic failure.
  2. All of the LUNs were continuously replicated to another SAN over the years, so these replicated LUNs were presented to the hosts in the compute-a cluster.
  3. After a rescan, the VMFS volumes were re-signatured and the datastores and all files were again accessible.
  4. All of the VMs on those datastores were manually added back to the vCenter Inventory except the DLRs and ESGs.
  5. All DLRs and ESGs were deleted from the datastore so that they could be freshly re-deployed.

The customer did realize that in point number 5 above that any ESGs re-added to the inventory would no longer be valid because of mismatched UUIDs. Deleting these from disk and re-deploying was a good idea.

NSX is throwing many high and critical events because of the missing ESG and DLR appliances, as expected.

tshoot13a-1

There are six appliances in total, including three DLRs and three ESGs.

Continue reading “NSX Troubleshooting Scenario 13”

ESG/DLR tmpfs partition fills in NSX 6.3.6 and 6.4.1

If you are running NSX 6.3.6 or 6.4.1, you should take a close look at VMware KB 57003. A newly discovered issue can result in the tmpfs partition of DLRs and ESGs from filling up, rendering the appliances unmanageable.

On a positive note, there should be no datapath impact because of a full tmpfs partition. You just won’t be able to push any configuration changes to the ESG or DLR in this state.
This occurs because of a file related to HA in /run that will slowly grow until it fills the partition. The file in question is ‘ha.cid.Out’ and contains HA diagnostic information. You can find it in the /run/vmware/vshield/cmdOut directory.

If you have a very stable environment, it’s quite possible that you’ll never run into this problem. The ha.cid.Out file is created and updated only after an HA event occurs – like a failover or split-brain recovery for example. Once the file is created, however, it receives regular updates and will inevitably grow.

Based on the rate in which the file grows, a compact size ESG or DLR has about a month after an HA event before this becomes a problem. Larger sized ESGs have more memory, and hence larger tmpfs partitions. Below is an estimate based on tmpfs partition size on each size of appliance:

All DLRs (256MB tmpfs): 4 weeks
Compact ESG (256MB tmpfs): 4 weeks
Large ESG (497MB tmpfs): 8 weeks
Quad Large ESG (1024MB tmpfs): 4 months
X-Large ESG (3.9GB tmpfs): >1 year

Unfortunately, it doesn’t appear that the ha.cid.Out file can be deleted or purged while the ESG/DLR is in operation. The file is locked for editing and the only safe way to recover is to reboot the appliance. Again, all of the features including routing and packet forwarding will continue to work just fine with a full tmpfs partition. You just won’t be able to make any changes.

Disabling ESG HA will prevent this from happening, but I’d argue that being protected by HA is more important than the potential for an ESG to become unmanageable.

You can monitor your ESG’s tmpfs partition using the show system storage CLI command:

esg-lb1.vswitchzero.net-0> show system storage
Filesystem      Size   Used   Avail   Use%   Mounted on
/dev/root       444M 366M 55M 88% /
tmpfs           497M 80K 497M 1% /run
/dev/sda2        43M 2.2M 38M 6% /var/db
/dev/sda3        27M 413K 25M 2% /var/dumpfiles
/dev/sda4        32M 1.1M 29M 4% /var/log

If you see it slowly creeping up in size at a regular interval, it would be a good idea to start planning for a maintenance window to reboot the appliance.

I can’t comment on release dates, but it’s very likely that this will be fixed next release of 6.4.x, which should out very soon. The 6.3.x fix for this may be further out, so a jump to 6.4.2 may be your best bet if this proves to a serious problem for you.

I hope this is helpful.

Missing NSX vdrPort and Auto Deploy

If you are running Auto Deploy and noticed your VMs didn’t have connectivity after a host reboot or upgrade, you may have run into the problem described in VMware KB 52903. I’ve seen this a few times now with different customers and thought a PSA may be in order. You can find all the key details in the KB, but I thought I’d add some extra context here to help anyone who may want more information.

I recently helped to author VMware KB 52903, which has just been made public. Essentially, it describes a race condition causing a host to come up without its vdrPort connected to the distributed switch. The vdrPort is an important component on an ESXi host that funnels traffic to/from the NSX DLR module. If this port isn’t connected, traffic can’t make it to the DLR for east/west routing on that host. Technically, VMs in the same logical switches will be able to communicate across hosts, but none of the VMs on this impacted host will be able to route.

The Problem

The race condition occurs when the DVS registration of the host occurs too late in the boot process. Normally, the distributed switch should be initialized and registered long before the vdrPort gets connected. In some situations, however, DVS registration can be late. Obviously, if the host isn’t yet initialized/registered with the distributed switch, any attempt to connect something to it will fail. And this is exactly what happens.

Using the log lines from KB 52903 as an example, we can see that the host attempts to add the vdrPort to the distributed switch at 23:44:19:

2018-02-08T23:44:19.431Z error netcpa[3FFEDA29700] [Originator@6876 sub=Default] Failed to add vdr port on dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d, Not found

The reason the operation fails is because the DVS switch with the UUID specified is not found from the perspective of this host. It simply hasn’t been initialized yet. A few moments later, the DVS is finally ready for use on the host. Notice the time stamps – you can see the registration of the DVS about 9 seconds later:

2018-02-08T23:44:28.389Z info hostd[4F540B70] [Originator@6876 sub=Hostsvc.DvsTracker] Registered Dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d

The above message can be found in /var/log/hostd.log.

Continue reading “Missing NSX vdrPort and Auto Deploy”

The NSX DLR and ARP Suppression

ARP suppression is one of the key fundamental features in NSX that helps to make the product scalable. By intercepting ARP requests from VMs before they are broadcast out on a logical switch, the hypervisor can do a simple ARP lookup in its own cache or on the NSX control cluster. If an ARP entry exists on the host or control cluster, the hypervisor can respond directly, avoiding a costly broadcast that would likely need to be replicated to many hosts.

ARP Suppression has existed in NSX since the beginning, but it was only available for VMs connected to logical switches. Up until NSX 6.2.4, the DLR kernel module did not benefit from ARP suppression and every non-cached entry needed to be broadcast out. Unfortunately, the DLR – like most routers – needs to ARP frequently. This can be especially true due to the easy L3 separation that NSX allows using logical switches and efficient east-west DLR routing.

Despite having code in the 6.2.4 and later version DLRs to take advantage of ARP suppression, a large number of deployments are likely not actually taking advantage of this feature due to a recently identified problem.

VMware KB 51709 briefly describes this issue, and makes note of the following conditions:

“DLR ARP Suppression may not be effective under some conditions which can result in a larger volume of ARP traffic than expected. ARP traffic sent by a DLR will not be suppressed if an ESXi host has more than one active port connected to the destination VNI, for example the DLR port and one or more VM vNICs.”

What isn’t clear in the KB article, but can be inferred based on the solution is that the problem is related to VLAN tagging on logical switch dvPortgroups. Any dvPortgroup associated with a logical switch with a VLAN ID specified is impacted by this problem.

Continue reading “The NSX DLR and ARP Suppression”