New NSX Controller Issue Identified in 6.3.3 and 6.3.4.

Having difficulty deploying NSX controllers in 6.3.3? You are not alone. VMware has just made public a newly discovered bug impacting NSX controllers based on the Photon OS platform. This includes NSX 6.3.3 and 6.3.4.  VMware KB 000051144 provides a detailed summary of the symptoms, but essentially:

  • New NSX 6.3.3 Controllers will fail to deploy after November 2nd, 2017.
  • New NSX 6.3.4 Controllers will fail to deploy after January 1st, 2018.
  • Controllers deployed before this date will be prompting for a new password on login attempt.

That said, if you attempted a fresh deployment of NSX 6.3.3 today, you would not be able to deploy a control cluster.

The issue appears to stem from root and admin account credentials expiring 90 days after the creation of the NSX build. This is not 90 days after it’s deployed, but rather 90 days after the build was created by VMware. This is why NSX 6.3.3 will begin having issues after November 2nd and 6.3.4 will be fine until January 1st 2018.

Some important points:

  • If you have already deployed NSX 6.3.3 or 6.3.4, don’t worry – your controllers will continue to function just fine. Having expired admin/root passwords will not break communication between NSX components.
  • This issue does not pose any kind of datapath impact. It will only pose issues if you attempt a fresh deployment, attempt to upgrade or delete and re-deploy controllers.
  • Until you’ve had a chance to implement the workaround in KB 000051144, you should obviously avoid any of the mentioned workflows.

It appears that VMware will be re-releasing new builds of the existing 6.3.3 and 6.3.4 downloads with the fix in place, along with a fix in 6.3.5 and future releases. They have already added the following text to the 6.3.3 and 6.3.4 release notes:

Important information about NSX 6.3.3: NSX for vSphere 6.3.3 has been repackaged to address the problems mentioned in VMware Knowledge Base articles 2151719 and 000051144. The originally released build 6276725 is replaced with build 7087283. Please refer to the Knowledge Base articles for more detail. See Upgrade Notes for upgrade information.

Old 6.3.3 Build Number: 6276725
New 6.3.3 Build Number: 7087283

Old 6.3.4 Build Number: 6845891
New 6.3.4 Build Number: 7087695

As an added bonus, VMware took advantage of this situation to include the fix for the NSX controller disconnect issue in 6.3.3 as well. This other issue is described in VMware KB 2151719. Despite what it says in the 6.3.4 release notes, only 6.3.3 was susceptable to the issue outlined in KB 2151719.

If you’ve already found yourself in this predicament, VMware has provided an API call that can be used as a workaround. The API call appears to correct the issue by setting the appropriate accounts to never expire. If the password has already expired, it’ll reset it. It’s then up to you to change the password. Detailed steps can be found in KB 000051144.

It’s unfortunate that another controller issue has surfaced after the controller disconnect issue discovered in 6.3.3. Whenever there is a major change like the introduction of a new underlying OS platform, these things can clearly be missed. Thankfully the impact to existing deployments is more of an inconvenience than a serious problem. Kudos to the VMware engineering team for working so quickly to get these fixes and workarounds released!

 

7 thoughts on “New NSX Controller Issue Identified in 6.3.3 and 6.3.4.”

  1. Hi Mike,

    great article, I was gonna blog the same but you anticipated me.

    Re: KB 000051144 the following statement is wrong:

    “If any or all of the Controllers are redeployed repeat the preceding steps again”

    This seems to imply that running the API calls will fix the issues on controllers that are redeployed unfortunately it’s’ not the case. Any newly deployed controller will hang until root and admin account are manually fixed.

    1. Thanks for the feedback, Giuliano! I haven’t had a chance to run through this procedure myself yet but hoping to reproduce it in my lab soon. I will reach out to the KB team to see if they can correct/clarify that statement as well.

    2. I reported this to VMWare and they are going to look at the KB to get it amended. No doubt just to say that it does not do it as it seems you need “special” account to resolve it.

  2. Thanks Mike for going through this issue and explanation.
    Unfortunately I changed the password of 3 Controllers manually after getting the “expired” message directly in Controllers. Now they are showing disconnected and even API Call couldn’t fix the issue. How to change Controller password in Manager to let it run the script. I even tried to change back to the original password in Controller but I got the “ERROR: Failed to update user: admin” error.
    Any thought?
    Regards
    Heidar

    1. Hi Heidar,

      That’s odd – not sure of a way to update NSX manager’s password entries for the control cluster. Given that the fix is out now, it may be best to simply upgrade NSX manager to the revised build of 6.3.3/6.3.4 (or a newer version if you prefer) then delete/redeploy each controller. The redeployed controllers should no longer have the password expiry problem if deployed from a good build of NSX manager. I’m not sure what the health of the cluster will look like during the delete/redeploy so it would be best to do this during a maintenance window. VMware support could help you determine the best sequence of tasks and help you to assess the situation.

      I hope this helps.

      Thanks,
      Mike

      1. Hi Mike,
        I just want to share the workaround here. I managed to change password of Controller by hacking Photon OS boot and changed the password back to what it was by a simple #passwd in linux. Just after that NSX Manager regained control and was able to talk to Controllers again.
        Thanks anyway,
        Cheers,
        Heidar

Leave a comment