NSX Troubleshooting Scenario 13 – Solution

Welcome to the thirteenth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was having issues re-deploying ESGs and DLRs that had been removed from the vCenter inventory. This was all part of a cleanup activity that occurred due to a SAN outage.

tshoot13a-5 — Not a very informative error message. We’ll need to refer to the logging to find out more.

To begin, we’ll really need more information on exactly why NSX is failing to re-deploy these ESGs and DLRs. The message in the UI is not very informative. As mentioned, there are no failed tasks in the vCenter recent tasks pane when an attempt is made, so we’ll need to go digging into the logging to find out more.

Taking a look at the vsm.log file on the NSX Manager appliance, we can see a backtrace occur at the same time as the deployment attempt:

2018-12-06 23:43:17.681 GMT+00:00 ERROR TaskFrameworkExecutor-19 Worker:219 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] BaseException thrown while executing task instance taskinstance-100656 com.vmware.vshield.edge.exception.EdgeVmDeploymentFailedException: nested exception is com.vmware.vshield.vsm.inventory.vcoperations.OvfManagerInternalErrorException:
core-services:1100:OVF Manager internal error. For more details, refer to the rootCauseString or the VC logs:
Managed object id datastore-26 of type Datastore was not found in VC.

The key part of the message that is of interest is the following:

“Managed object id datastore-26 of type Datastore was not found in VC.”

If you recall the sequence of events discussed in the first half of the scenario, all of the datastores were re-added from replicated LUNs on another SAN. Although the datastores have the same names, and all of the files were retained, the VMFS volumes were re-signatured. This means that from the perspective of vCenter, these are totally new datastores with new UUIDs.

Let’s have a quick look at the moref identifier associated with the shared-hdd0 datastore where dlr-a1 used to reside. Is it still datastore-26?

tshoot3b-1 — Using the URL string trick, we can see that shared-hdd0 has a moref of datastore-682.

Not anymore. It’s now datastore-682. This means that every time we try to re-deploy, NSX is attempting to do so based on the old location in its database. Since datastore-26 doesn’t exist, it can’t even start the process.

Thankfully, fixing this is quite straight-forward. Even though the appliance doesn’t exist, we can still update its configuration. We simply need to change datastore-26 to a valid datastore location that can be accessed by the deployment cluster selected.

tshoot3b-2 — Editing the appliance allows us to change its configuration even if it’s not deployed.

This can be updated via API call, but in recent versions of NSX, we can simply change this from the UI. Notice the drop-down for datastore is blank because of the invalid selection that was there previously.

tshoot3b-3 — Deployment occurs immediately after changing the datastore to a valid one.

Immediately after modifying the appliance and selecting shared-hdd0 as the datastore, the appliance started deploying on its own. A manual re-deploy task is not necessary.

A quick GET /api/4.0/edges/edge-1 REST API call confirms that edge-1 is now using datastore-682:

tshoot3b-4

The above process works for both ESGs and DLRs, but you’ll need to remember to change both appliances if enabled for HA mode. Once both HA appliances have been changed to a valid datastore, they’ll start deploying.

Reader Feedback

Great feedback from @MarcelMertens! Pretty much right on the money:

https://t.co/epqQ1KVAnw
"Redeploy might not work in the following cases:
The datastore on which the NSX Edge was installed is corrupted/unmounted or in-accessible.
…you must update the MoId of the resource pool, datastore, or dvPortGroup using a REST API call." ???

— Marcel Mertens (@MarcelMertens) December 6, 2018

The public documentation references an API call to update the moref, but it does seem that you can make this change in UI – at least in recent versions of NSX.

Thanks to everyone who left comments and suggestions!

Conclusion

You’ll find that vSphere, NSX and the majority of VMware’s products rely on unique identifiers as opposed to names to identify inventory objects. Knowing how to find and compare these can make all the difference in troubleshooting. For more information on how to find common moref identifiers used by NSX, see this post.

I hope this scenario was helpful. If you have any questions or have suggestions for future scenarios, please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)

Getting Started

Reader Feedback

Conclusion

Share this:

Related

Leave a comment Cancel reply