VMware just announced a new bug discovered in NSX 6.3.3. Those running 6.3.3 or planning to upgrade in the near-term may want to familiarize themselves with VMware KB 2151719.
As you may know, VMware moved from a Debian based distribution for the underlying OS of the NSX controllers to their Photon OS platform. This is why the upgrade process includes the complete redeployment of all three controller nodes.
It appears that a scheduled clean-up script on the controllers used to prevent the partitions from filling is also removing some files required for NSX Manager to communicate and authenticate with the controller via REST API.
Most folks running 6.3.3 in a stable deployment will likely not have noticed, but an event disrupting communication between Manager and the Controllers can prevent them from reconnecting. Some examples would include a reboot of the NSX manager, or a network disruption.
Thankfully, the NSX Controller core functions – managing the VXLAN and distributed logical routing control plane – will continue to work in this state and dataplane disruptions should not be experienced.
KB 2151719 discusses a pretty simple workaround of restarting the api-server service on any impacted controllers. This is a non-disruptive action and should be safe to do at any time. The command is the following:
nsx-controller # restart api-server
VMware will likely be addressing this in the next NSX release. If you are planning to upgrade, you may want to consider 6.3.2 or hold out for the next 6.3.x release.