Welcome to the third instalment of a new series of NSX-T troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.
Please see the first half for more detail on the problem symptoms and some scoping.
As we saw in the first half, the customer’s management cluster was in a degraded state. This was due to one manager – 172.16.1.41 – being in a wonky half-broken state. Although we could ping it, we could not login and all of the services it was contributing to the NSX management cluster were down.
What was most telling, however, was the screenshot of the VM’s console window.
The most important keyword there was “Read-only file system”. As many readers had correctly guessed, this is a very common response to an underlying storage problem. Like most flavors of Linux, the Linux-based OS used in the NSX appliances will set their ext4 partitions to read-only in the event of a storage failure. This is a protective mechanism to prevent data corruption and further data loss.
When this happens, the guest may be partially functional, but anything that requires write access to the read-only partitions will obviously be in trouble. This is why we could ping the manager appliance, but all other functionality was broken. The manager cluster uses ZooKeeper for clustering services. ZooKeeper requires consistent and low-latency write access to disk. Because this wasn’t available to 172.16.1.41, it was marked as down in the cluster.
After discussing this with our fictional customer, we were able to confirm that an ESXi host esx-e3 experienced a total storage outage for a few minutes and that it had since been fixed. They had assumed it was not related because the appliance was on esx-e1, not esx-e3.
It’s been a while since I’ve posted anything, so what better way to get back into the swing of things than a troubleshooting scenario! These last few months I’ve been busy learning the ropes in my new role as an SRE supporting NSX and VMware Cloud on AWS. Hopefully I’ll be able to start releasing regular content again soon.
Welcome to the third NSX-T troubleshooting scenario! What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
As always, we’ll start with a fictional customer problem statement:
“I’m not experiencing any problems, but I noticed that my NSX-T 2.4.1 manager cluster is in a degraded state. One of the unified appliances appears to be down. I can ping it just fine, but I can’t seem to login to the appliance via SSH. I’m sure I’m putting in the right password, but it won’t let me in. I’m not sure what’s going on. Please help!”
From the NSX-T Overview page, we can see that one appliance is red.
Let’s have a look at the management cluster in the UI:
The problematic manager is 172.16.1.41. It’s reporting its cluster connectivity as ‘Down’ despite being reachable via ping. It appears that all of the services including controller related services are down for this appliance as well.
Strangely, it doesn’t appear to be accepting the admin or root passwords via SSH. We always get an ‘Access Denied’ response. We can login successfully to the other two appliances without issue using the same credentials.
Opening a console window to 172.16.1.41 greets us with the following:
Error messages appear to continually scroll by from system-journald mentioning “Failed to write entry”. Hitting enter gives us the login prompt, but we immediately get the same error messages and can’t login.
It seems pretty clear that there is something wrong with 172.16.1.41, but what may have caused this problem? How would you fix this and most importantly, how can you root cause this?
I’ll post the solution in the next day or two, but how would you handle this scenario? Let me know! Please feel free to leave a comment below or via Twitter (@vswitchzero).
Using REST API calls to limit NSX user permissions to specific objects only.
There is a constant stream of new features added with each release of NSX, but not all of the original features have survived. NSX Data Security was one such feature, but VMware also removed the ‘Limit Scope’ option for user permissions in the NSX UI with the release of 6.2.0 back in 2015. Every so often I’ll get a customer asking where this feature went.
The ‘Limit Scope’ feature allows you to limit specific NSX users to specific objects within the inventory. For example, you may want to provide an application owner with full access to only one specific edge load balancer, and to ensure they have access to nothing else in NSX.
The primary reason that the feature was scrapped in 6.2 was because of UI problems that would occur for users restricted to only specific resources. To view the UI properly and as intended, you’d need access to the ‘global root’ object that is the parent for all other NSX managed objects. VMware KB 2136534 is about the only source I could find that discusses this.
REST API Calls Still Exist
Although the ‘Limit Scope’ option was removed from the UI in 6.2 and later, you may be surprised to discover that the API calls for this feature still exist.
To show how this works, I’ll be running through a simple scenario in my lab. For this test, we’ll assume that there are two edges – mercury-esg1 and mercury-dlr that are related to a specific application deployment. A vCenter user called test in the vswitchzero.net domain requires access to these two edges, but we don’t want them to be able to access anything else.
The two edges in question have morefs edge-4 and edge-5 respectively. For more information on finding moref IDs for NSX objects, see my post on the subject here.
If you are new to NSX or looking to evaluate it in the lab, there is one very common issue that you may run into. After going through the initial steps of deploying and registering NSX Manager with vCenter, you may be surprised to find that there are no manageable NSX managers listed under ‘Networking and Security’ in the Web Client. Although the registration and Web Client plugin installation appears successful, there is often an extra step needed before you can manage things.
One of the first tasks involved in deploying NSX is to register NSX Manager with a vCenter Server. This is done for inventory management and synchronization purposes. The NSX Manager can be optionally registered with SSO as well.
The vCenter user that is used for registration needs to have the highest level of privileges for NSX to work correctly. The NSX install guide clearly states that this must be the vCenter ‘Administrator’ role.
From the NSX Install Guide:
“You must have a vCenter Server user account with the Administrator role to synchronize NSX Manager with the vCenter Server. If your vCenter password has non-ASCII characters, you must change it before synchronizing the NSX Manager with the vCenter Server.”
Because of these requirements, it’s quite common to use the SSO administrator account – usually firstname.lastname@example.org. A service account is also often created for this purpose to more easily identify and distinguish NSX tasks. Either way, these are not normally accounts that you’d use for day-to-day administration in vSphere.
By default, NSX will only assign its ‘Enterprise Administrator’ role to the user account that was used to register it with vCenter Server. This means that by default, only that specific vCenter user will have access to the NSX manager from within the Web Client.
That said, if you are experiencing this problem, you are probably not logged in with the vCenter user that was used for registration purposes. To grant access to other users, you’ll need to log into the vSphere Web Client using the registration user account, and then add additional users and groups.
In my lab, I’ve just logged in with an active directory user called ‘email@example.com’. This user has full administrator privileges in vCenter, but has no access to any NSX Managers:
If I log out, and log back in with the firstname.lastname@example.org account that was used for vCenter registration, I can see the NSX managers that were registered.
In my lab, I’ve got a secondary deployed as well, but we’ll focus only on 172.16.10.40. If I click on that manager in the list, I’m able to go to the ‘Users’ tab to see what the default permissions look like:
As you can see, only one user – the SSO administrator account used for registration – has the requisite role for administrator via the Web Client. In my lab, I want to provide full access to an AD group called ‘VMware Admins’ and an individual user called ‘Test’.
Both vCenter users and groups can be specified here. As long as vCenter can authenticate them – either via SSO, local authentication or even AD – they are fair game.
Another common mistake made is selecting the NSX Administrator role rather than Enterprise Administrator. NSX Administrator sounds like the highest privilege level, but it’s actually Enterprise Administrator that gives you all the keys to the kingdom. You won’t be able to administer certain things – including user permissions – unless Enterprise Administrator is chosen.
Once this is done, you’ll see the users and groups listed and should now have the correct permissions to administer the NSX deployment!
Keep in mind that if you’ve got more than one NSX manager deployed, you’ll need to set this on each independently.
Have any questions or want more information? Please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)
Unlike previous 6.3.x releases, 6.3.5 has some new upgrade minimum version compatibility requirements. This is not only true from a vSphere perspective, but also for the version of NSX 6.2.x you are running. If you are running an older 6.2.0, 6.2.1 or 6.2.2 release of NSX, you’ll need to upgrade to at least 6.2.4 before taking the big step up to 6.3.5. VMware has just updated the NSX Upgrade Matrix to reflect this requirement:
I expect that VMware will update the 6.3.5 release notes and release a new KB article very shortly. I’ll provide some more detail when that is out. In the meantime, please be sure to heed the version requirements or you will most likely run into problems.
Thankfully there aren’t too many customers still running these old releases of 6.2.x, but if you have already attempted the upgrade and hit problems, you’ll need to roll back. If you took a cold-snapshot of the manager or a clone, you can roll back that way. Otherwise, you’ll need to deploy the original 6.2.x OVA again and restore your FTP backup.
** Edit 11/29/2017: VMware has just updated the NSX 6.3.5 release notes to include mention of the minimum version requirements. The following statement was added:
Important: If you are upgrading NSX 6.2.0, 6.2.1, or 6.2.2 to NSX 6.3.5, you must complete a workaround before starting the upgrade. See VMware Knowledge Base article 000051624 for details.
VMware calls it a “workaround” but it’s basically just upgrading to an interim version before going to 6.3.5. In KB 000051624, VMware recommends going to 6.2.9 as that workflow has been tested. I.e. upgrading from 6.2.0 to 6.2.9, and then to 6.3.5. On a positive note, you only need to upgrade your NSX Manager to 6.2.9, no other components need to be upgraded before proceeding on to 6.3.5.
If you attempt an upgrade from 6.2.2 or older releases, my understanding is that the upgrade will appear to be completed successfully, but your configuration will be missing. VMware calls out the remediation steps of rolling back to the previous version should you run into this issue.
In an interesting move, VMware has released public KB 2149630 on September 29th, providing information on how to access the root shell of the NSX Manager appliance.
If you’ve been on an NSX support call with VMware dealing with a complex issue, you may have seen your support engineer drop into a special shell called ‘Engineering Mode’. This is sometimes also referred to as ‘Tech Support Mode’. Regardless of the name used, this is basically a root bash shell on the underlying Linux based appliance. From here, system configuration files and scripts as well as most normal Linux functions can be accessed.
Normally, when you open a console or SSH session to NSX manager, you are dropped into a restricted ‘admin’ shell with a hierarchical system of commands like Cisco’s IOS. For the majority of what an administrator needs to do, this is sufficient. It’s only in more complex cases – especially when dealing with issues in the Postgres DB – or issues with the underling OS that this may be required.
There are several important statements and disclaimers that VMware makes in this KB article that I want to outline below:
“Important: Do not make any changes to the underlying system without the help of VMware Technical Support. All such changes are not supported and as a result, your system may no longer be supportable by GSS.”
In NSX 6.3.2 and later, you’ll also be greeted by the following disclaimer:
“Engineering Mode: The authorized NSX Manager system administrator is requesting a shell which is able to perform lower level unix commands/diagnostics and make changes to the appliance. VMware asks that you do so only in conjunction with a support call to prevent breaking your virtual infrastructure. Please enter the shell diagnostics string before proceeding.Type Exit to return to the NSX shell. Type y to continue:”
And finally, you’ll want to ensure you have a full backup of NSX Manager should anything need to be modified:
VMware recommends to take full backup of the system before performing any changes after logging into the Tech Support Mode.
Although it is very useful to take a ‘read only’ view at some things in the root shell, making any changes is not supported without getting direct assistance from VMware support.
A few people have asked whether or not making the root shell password public is a security issue, but the important point to remember is that you cannot even get to a position where you can enter the shell unless you are already logged in as an NSX enterprise administrator level account. For example, the built-in ‘admin’ account. For anyone concerned about this, VMware does allow the root password to be changed. It’s just critical that this password not be lost in case VMware support requires access to the root shell for troubleshooting purposes. More information on this can be found in KB 2149630.
To be honest, I’m a bit torn on this development. As someone who does backline support, I know what kind of damage that can be done from the root shell – even with the best intentions. But at the same time, I see this as empowering. It gives customers additional tools to troubleshoot and it also provides some transparency into how NSX Manager works rather than shielding it behind a restricted shell. I think that overall, the benefits outweigh the risks and this was a positive move for VMware.
When I think back to VI 3.5 and vSphere 4.0 when ESXi was shiny and new, VMware initially took a similar stance. You had to go so far as to type ‘UNSUPPORTED’ into the console to access a shell. Today, everyone has unrestricted root access to the hypervisor. The same holds true for the vCenter appliance – the potential for destruction is no different.
I’d welcome any comments or thoughts. Please share them below!
NSX isn’t just a few virtual machines that can be deleted – there are hooks into numerous vCenter objects and it must be removed properly.
Admittedly, removing NSX from an environment was not my first choice of topics to cover, but I have found that the process is often misunderstood and done improperly. NSX isn’t just a few virtual machine appliances that can be deleted – there are hooks into numerous vCenter objects, your ESXi hosts and vCenter Server itself. To save yourself from some grief and a lot of manual cleanup, the removal must be done properly.
Thankfully, VMware does provide some high level instructions to follow in the public documentation. You’ll find these public docs for NSX 6.2.x and 6.3.x respectively here and here.
There are many reasons that someone may wish to remove NSX from a vSphere environment – maybe you’ve installed an evaluation copy to run a proof of concept or just want to start fresh again in your lab environment. In my case I need to completely remove NSX 6.2.5 and install an older version of NSX for some version-specific testing in my home lab.
From a high level, the process should look something like this:
Remove all VMs from Logical Switches.
Remove NSX Edges and Distributed Logical Routers.
Remove all Logical Switches.
Uninstall NSX from all ESXi hosts in prepared clusters.
Delete any Transport Zones.
Delete the NSX Manager and NSX Controller appliances.
Remove the NSX Manager hooks into vCenter, including the plugin/extension.
Cleaning up the vSphere Web Client leftovers on the vCenter Server.