No NSX Managers Listed in the Web Client

If you are new to NSX or looking to evaluate it in the lab, there is one very common issue that you may run into. After going through the initial steps of deploying and registering NSX Manager with vCenter, you may be surprised to find that there are no manageable NSX managers listed under ‘Networking and Security’ in the Web Client. Although the registration and Web Client plugin installation appears successful, there is often an extra step needed before you can manage things.

One of the first tasks involved in deploying NSX is to register NSX Manager with a vCenter Server. This is done for inventory management and synchronization purposes. The NSX Manager can be optionally registered with SSO as well.

nsx-icon-1
In my lab, I’ve used the SSO administrator account for registration.

The vCenter user that is used for registration needs to have the highest level of privileges for NSX to work correctly. The NSX install guide clearly states that this must be the vCenter ‘Administrator’ role.

From the NSX Install Guide:

“You must have a vCenter Server user account with the Administrator role to synchronize NSX Manager with the vCenter Server. If your vCenter password has non-ASCII characters, you must change it before synchronizing the NSX Manager with the vCenter Server.”

Because of these requirements, it’s quite common to use the SSO administrator account – usually administrator@vsphere.local. A service account is also often created for this purpose to more easily identify and distinguish NSX tasks. Either way, these are not normally accounts that you’d use for day-to-day administration in vSphere.

By default, NSX will only assign its ‘Enterprise Administrator’ role to the user account that was used to register it with vCenter Server. This means that by default, only that specific vCenter user will have access to the NSX manager from within the Web Client.

That said, if you are experiencing this problem, you are probably not logged in with the vCenter user that was used for registration purposes. To grant access to other users, you’ll need to log into the vSphere Web Client using the registration user account, and then add additional users and groups.

In my lab, I’ve just logged in with an active directory user called ‘test@lab.local’. This user has full administrator privileges in vCenter, but has no access to any NSX Managers:

nsx-icon-3
No managers to manage!

If I log out, and log back in with the administrator@vsphere.local account that was used for vCenter registration, I can see the NSX managers that were registered.

nsx-icon-4

In my lab, I’ve got a secondary deployed as well, but we’ll focus only on 172.16.10.40. If I click on that manager in the list, I’m able to go to the ‘Users’ tab to see what the default permissions look like:

nsx-icon-5

As you can see, only one user – the SSO administrator account used for registration – has the requisite role for administrator via the Web Client. In my lab, I want to provide full access to an AD group called ‘VMware Admins’ and an individual user called ‘Test’.

nsx-icon-6

Both vCenter users and groups can be specified here. As long as vCenter can authenticate them – either via SSO, local authentication or even AD – they are fair game.

nsx-icon-7

Another common mistake made is selecting the NSX Administrator role rather than Enterprise Administrator. NSX Administrator sounds like the highest privilege level, but it’s actually Enterprise Administrator that gives you all the keys to the kingdom. You won’t be able to administer certain things – including user permissions – unless Enterprise Administrator is chosen.

nsx-icon-8

Once this is done, you’ll see the users and groups listed and should now have the correct permissions to administer the NSX deployment!

Keep in mind that if you’ve got more than one NSX manager deployed, you’ll need to set this on each independently.

Have any questions or want more information? Please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)

 

The NSX DLR and ARP Suppression

ARP suppression is one of the key fundamental features in NSX that helps to make the product scalable. By intercepting ARP requests from VMs before they are broadcast out on a logical switch, the hypervisor can do a simple ARP lookup in its own cache or on the NSX control cluster. If an ARP entry exists on the host or control cluster, the hypervisor can respond directly, avoiding a costly broadcast that would likely need to be replicated to many hosts.

ARP Suppression has existed in NSX since the beginning, but it was only available for VMs connected to logical switches. Up until NSX 6.2.4, the DLR kernel module did not benefit from ARP suppression and every non-cached entry needed to be broadcast out. Unfortunately, the DLR – like most routers – needs to ARP frequently. This can be especially true due to the easy L3 separation that NSX allows using logical switches and efficient east-west DLR routing.

Despite having code in the 6.2.4 and later version DLRs to take advantage of ARP suppression, a large number of deployments are likely not actually taking advantage of this feature due to a recently identified problem.

VMware KB 51709 briefly describes this issue, and makes note of the following conditions:

“DLR ARP Suppression may not be effective under some conditions which can result in a larger volume of ARP traffic than expected. ARP traffic sent by a DLR will not be suppressed if an ESXi host has more than one active port connected to the destination VNI, for example the DLR port and one or more VM vNICs.”

What isn’t clear in the KB article, but can be inferred based on the solution is that the problem is related to VLAN tagging on logical switch dvPortgroups. Any dvPortgroup associated with a logical switch with a VLAN ID specified is impacted by this problem.

Continue reading “The NSX DLR and ARP Suppression”

Building a Retro Gaming Rig – Part 6

It’s been a while since my last retro build post, but the build is finally complete! Actually, it’s been done for several weeks, but I just haven’t have time to break out the camera and get some pictures.

Without further ado, let’s have a look at the finishing touches and the final build!

Completed Build

For the case, I decided to use an old Antec NSK 3480 that I had lying around. I love the very simple appearance, but most of all, the small dimensions. With a depth of only 14 inches, it’s only as deep as it is tall. This can make for some challenges, but also makes it a perfect fit for slim boards.

Although it really doesn’t look like a retro box, I like that it’s very unassuming and feels a lot like a ‘sleeper’ build. I’ve got an old Pentium 90 and yellowed 486 rig that I’ll save for that true old-school appearance!

retro6-4

The Antec case was perfect for the MSI MS-6160 board. It even left enough space at the front of the case for the IDE-to-SD adapter board. For more detail on the motherboard, check out Part 3 of the series.

Continue reading “Building a Retro Gaming Rig – Part 6”

NSX Troubleshooting Scenario 2 – Solution

Welcome to the second installment of a new series of NSX troubleshooting scenarios. This is the second half of scenario two, where I’ll perform some troubleshooting and resolve the problem.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As mentioned in the first half, the problem is limited to a host called esx-a1. As soon as a guest moves to that host, it has no network connectivity. If we move a guest off of the host, its connectivity is restored.

tshoot2b-1

We have one VM called win-a1 on host esx-a1 for testing purposes at the moment. As expected, the VM can’t be reached.

To begin, let’s have a look at the host from the CLI to figure out what’s going on. We know that the UI is reporting that it’s not prepared and that it doesn’t have any VTEPs created. In reality, we know a VTEP exists but let’s confirm.

To begin, we’ll check to see if any of the VIBs are installed on this host. With NSX 6.3.x, we expect to see two VIBs listed – esx-vsip and esx-vxlan.

Continue reading “NSX Troubleshooting Scenario 2 – Solution”

NSX Troubleshooting Scenario 2

I got some overwhelmingly positive feedback after posting the first troubleshooting scenario and solution recently. Thanks to everyone who reached out to me via Twitter with feedback and suggestions! Please keep those suggestions and comments coming.

Today, I’m going to post a similar but more brief scenario. This is something that we see regularly in GSS – issues surrounding host preparation!

NSX Troubleshooting Scenario 2

Let’s begin with the usual vague customer problem description:

“We took a host out of the compute-a cluster to do some hardware maintenance. Now it’s been added back and when VMs move to this host, they have no connectivity! We’re using NSX 6.3.2”

This is a fictional scenario of course, but let’s assume that we’ve started taking a look at the environment and collecting some additional data.

As the customer mentioned, they are running NSX 6.3.2 and have a cluster called compute-a:

tshoot2a-1

The host that was taken out of the cluster for maintenance was esx-a1.lab.local. Similar to the previous scenario, the L3 design is pretty much the same:

Continue reading “NSX Troubleshooting Scenario 2”

NSX Troubleshooting Scenario 1 – Solution

Welcome to the second half of ‘NSX Troubleshooting Scenario 1’ . For detail on the problem and some initial scoping, please see the first part of the scenario that I posted a few days ago. In this half, I’ll walk through some of the troubleshooting I did to find the underlying cause of this problem as well as the solution.

Where to Start?

The scoping done in the previous post gives us a lot of useful information, but it’s not always clear where to start. In my experience, it’s helpful to make educated ‘assertions’ based on what I think the issue is – or more often what I think the issue is not.

I’ll begin by translating the scoping observations into statements:

  • It’s clear that basic L2/L3 connectivity is working to some degree. This isn’t a guarantee that there aren’t other problems, but it looks okay at a glance.
  • We know that win-b1 and web-a1 are both on the same VXLAN logical switch. We also know they are in the same subnet, so that eliminates a lot of the routing as a potential problem. The DLR and ESGs should not really be in the picture here at all.
  • The DFW is enabled, but looks to be configured with the default ‘allow’ rules only. It’s unlikely that this is a DFW problem, but we may need to prove this because the symptoms seem to be specific to HTTP.
  • We also know that VMs in the compute-b cluster are having the same types of symptoms accessing internet based web sites. We know that the infrastructure needed to get to the internet – ESGs, physical routers etc– are all accessed via the compute-a cluster.
  • It was also mentioned by the customer that the compute-b cluster was newly added. This may seem like an insignificant detail, but really increases the likelihood of a configuration or preparation problem.

Based on the testing done so far, the issue appears to be impacting a TCP service – port 80 HTTP. ICMP doesn’t seem impacted. We don’t know if other protocols are seeing similar issues.

Before we start health checking various NSX components, let’s do a bit more scoping to see if we can’t narrow this problem down even further. Right off the bat, the two questions I want answered are:

  1. Are we really talking to the device we expect from a L2 perspective?
  2. Is the problem really limited to the HTTP protocol?

Continue reading “NSX Troubleshooting Scenario 1 – Solution”

NSX Troubleshooting Scenario 1

Welcome to the first of what I hope to be many NSX troubleshooting posts. As someone who has been working in back-line support for many years, troubleshooting is really the bread and butter of what I do every day. Solving problems in vSphere can be challenging enough, but NSX adds another thick layer of complexity to wrap your head around.

I find that there is a lot of NSX documentation out there but most of it is on to how to configure NSX and how it works – not a whole lot on troubleshooting. What I hope to do in these posts is spark some conversation and share some of the common issues I run across from day to day. Each scenario will hopefully be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there. I hope to leave a gap of a few days between the problem and solution posts to give people some time to comment, ask questions and provide their thoughts on what the problem could be!

NSX Troubleshooting Scenario 1

As always, let’s start with a somewhat vague customer problem description:

“Help! I’ve deployed a new cluster (compute-b) and for some reason I can’t access internal web sites on the compute-a cluster or at any other internet site.”

Of course, this is really only a small description of what the customer believes the problem to be. One of the key tasks for anyone working in support is to scope the problem and put together an accurate problem statement. But before we begin, let’s have a look at the customer’s environment to better understand how the new compute-b cluster fits into the grand scheme of things.

Continue reading “NSX Troubleshooting Scenario 1”

Check NSX 6.2.x Compatibility Before Upgrading to 6.3.5!

Unlike previous 6.3.x releases, 6.3.5 has some new upgrade minimum version compatibility requirements. This is not only true from a vSphere perspective, but also for the version of NSX 6.2.x you are running. If you are running an older 6.2.0, 6.2.1 or 6.2.2 release of NSX, you’ll need to upgrade to at least 6.2.4 before taking the big step up to 6.3.5. VMware has just updated the NSX Upgrade Matrix to reflect this requirement:

622upg635-1
Screenshot taken from the VMware Interoperability Matrix site.

I expect that VMware will update the 6.3.5 release notes and release a new KB article very shortly. I’ll provide some more detail when that is out. In the meantime, please be sure to heed the version requirements or you will most likely run into problems.

Thankfully there aren’t too many customers still running these old releases of 6.2.x, but if you have already attempted the upgrade and hit problems, you’ll need to roll back. If you took a cold-snapshot of the manager or a clone, you can roll back that way. Otherwise, you’ll need to deploy the original 6.2.x OVA again and restore your FTP backup.

** Edit 11/29/2017: VMware has just updated the NSX 6.3.5 release notes to include mention of the minimum version requirements. The following statement was added:

Important: If you are upgrading NSX 6.2.0, 6.2.1, or 6.2.2 to NSX 6.3.5, you must complete a workaround before starting the upgrade. See VMware Knowledge Base article 000051624 for details.

VMware calls it a “workaround” but it’s basically just upgrading to an interim version before going to 6.3.5. In KB 000051624, VMware recommends going to 6.2.9 as that workflow has been tested. I.e. upgrading from 6.2.0 to 6.2.9, and then to 6.3.5. On a positive note, you only need to upgrade your NSX Manager to 6.2.9, no other components need to be upgraded before proceeding on to 6.3.5.

If you attempt an upgrade from 6.2.2 or older releases, my understanding is that the upgrade will appear to be completed successfully, but your configuration will be missing. VMware calls out the remediation steps of rolling back to the previous version should you run into this issue.

NSX Transport Zone Cluster Removal Issues

Ever remove a cluster from your NSX transport zone only to see it reappear on the list of clusters available for disconnection? Unfortunately, the task likely failed but NSX doesn’t always do a very good job of telling you why in the UI.

I was recently attempting to remove a cluster called compute-b from my transport zone so that I could remove and rebuild the hosts within. Needless to say, I ran into some difficulties and wanted to share my experience.

If you are interested in some more detailed instructions on how to decommission NSX prepared hosts, you can check out my post on Completely Removing NSX. From a high level, the steps I wanted to do were the following:

  1. Disconnect all VMs from logical switches in the cluster to be removed.
  2. Remove the cluster from the transport zone. This will remove all port groups associated with the logical switches (assuming no other clusters are connected to the same distributed switch)
  3. ‘Unconfigure’ VXLAN from the ‘Logical Network Preparation’ tab to remove all VTEPs.
  4. Uninstall the NSX VIBs from the Host Preparation Tab.

To begin, I used the ‘Remove VM’ button from the Logical Switches view in NSX. I removed all four of the VMs attached to the only Logical Switch being used at the moment. I saw a bunch of VM reconfigure tasks complete, and assumed it had completed successfully.

I then went to disconnect compute-b from the transport zone called Primary TZ. After removing the cluster and clicking OK, the dialog closed giving the impression that the task was successful. Oddly though, I didn’t see the tasks related to port group removal that I expected to see.

tzremove-1

Sure enough, I went back into the ‘Disconnect Clusters’ dialog and saw the compute-b cluster still in the list. Unfortunately, NSX doesn’t appear to report failures for this particular workflow in the UI.

Having worked in support for many years, I followed my first instinct and checked the NSX Manager vsm.log file for detail on why the operation failed. I received the below failure details:

2017-11-24 18:50:13.016 GMT+00:00 INFO taskScheduler-8 JobWorker:243 - Updating the status for jobinstance-101742 to EXECUTING
2017-11-24 18:50:13.022 GMT+00:00 INFO taskScheduler-8 SchedulerQueueServiceImpl:64 - [TF] Created a new bucket for module default_module and total number of buckets 1
2017-11-24 18:50:13.022 GMT+00:00 INFO taskScheduler-8 SchedulerQueueServiceImpl:80 - The task ShrinkVdnScope-vdnscope-1 (1511549412299) [id:task-102755] is added to the SchedulerQueue
2017-11-24 18:50:13.022 GMT+00:00 INFO pool-10-thread-1 ScheduleSynchronizer:48 - Start executing task: task-102755 and running executor threads 1
2017-11-24 18:50:13.042 GMT+00:00 INFO TaskFrameworkExecutor-3 VdnScopeServiceImpl$2:995 - New VDS (count: 1) is being removed when shrinking scope vdnscope-1. Shrinkingwires.
2017-11-24 18:50:13.061 GMT+00:00 ERROR TaskFrameworkExecutor-3 VirtualWireServiceImpl:1577 - validation failed at delete backing for dvportgroup-815 in the scope: vdnscope-1
2017-11-24 18:50:13.061 GMT+00:00 ERROR TaskFrameworkExecutor-3 VdnScopeServiceImpl$2:1015 - Shrink operation failed on TZ vdnscope-1
2017-11-24 18:50:13.061 GMT+00:00 ERROR TaskFrameworkExecutor-3 Worker:219 - BaseException thrown while executing task instance taskinstance-166334
com.vmware.vshield.vsm.vdn.exceptions.XvsException: core-services:819:Transport zone vdnscope-1 contraction error.
 at com.vmware.vshield.vsm.vdn.service.VirtualWireServiceImpl.validateShrink(VirtualWireServiceImpl.java:1578)
 at com.vmware.vshield.vsm.vdn.service.VdnScopeServiceImpl$2.doTask(VdnScopeServiceImpl.java:1001)
 at com.vmware.vshield.vsm.vdn.service.task.AbstractVdnRunnableTask.run(AbstractVdnRunnableTask.java:80)
 at com.vmware.vshield.vsm.task.service.Worker.runtask(Worker.java:184)
 at com.vmware.vshield.vsm.task.service.Worker.executeAsync(Worker.java:122)
 at com.vmware.vshield.vsm.task.service.Worker.run(Worker.java:99)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
2017-11-24 18:50:13.068 GMT+00:00 INFO TaskFrameworkExecutor-3 JobWorker:243 - Updating the status for jobinstance-101742 to FAILED

There is quite a bit there, but the key takeaways are that the task clearly failed and that the reason was: “validation failed at delete backing for dvportgroup-815 in the scope: vdnscope-1

This doesn’t tell us why exactly, but it seems clear that the operation can’t delete dvportgroup-815 and fails. In my experience, 99% of the time this is because there is still something connected to the portgroup.

Since there were only four VMs in the cluster, and no ESGs or DLRs – I wasn’t sure what could possibly be connected. I even shut down all four disconnected VMs and put all three hosts in maintenance mode just to be sure. None of these actions helped.

I then navigated to the Networking view in vCenter to have a look at the DVS associated with the compute-b cluster. In the ‘Ports’ view, you can get a good idea of what exactly is still connected to the distributed switch. To my surprise a VM called win-b1 was actually still showing as ‘Active’ and ‘Connected’ to the dvPortgroup associated with a Logical Switch!

tzremove-3

This dvPort state is clearly wrong – first of all, the VM was powered off so it could not be ‘Link Up’. Secondly I thought I had removed the VM. Or did I?

tzremove-4

Although I didn’t see any failures, it doesn’t appear that this VM was removed from the Logical Switch. Maybe I missed it, or perhaps it was a quirk due to the bug outlined in KB 2145889 where DirectPath I/O is enabled on VMs created with the vSphere Web Client. This was the only VM that had this option checked off, but despite my best efforts I could not reproduce the problem. Regardless, knowing what the problem was, I could simply disconnect the NIC and add it to another temporary portgroup.

This adjustment appeared to refresh the DVS port state and then I was able to remove the cluster from the Transport Zone successfully. 

When in doubt, don’t hesitate to dig into the NSX Manager logging. If the UI doesn’t tell you why something didn’t work or is light on details, the logging can often set you in the right direction!

NSX 6.3.5 Now Available!

Late yesterday, VMware made available NSX 6.3.5 (Build Number 7119875) for download. This is a full maintenance release including over 32 documented bug fixes. As you may recall, 6.3.4 was a minor patch release with only a few fixes, which is why 6.3.5 has been released so soon afterward.

There are numerous fixes in this release that will be of interest to a lot of customers. Of special note are the fixes related to Guest Introspection – a feature leveraged by 3rd party AV and security products, as well as the IDFW. I was very happy to see that GI got front and center attention in 6.3.5. Along with some needed CPU utilization fixes, there is also a fix for issue number 1897878, outlined in VMware KB 2151235 that I’ve seen quite a bit of in the wild.

Another interesting behavior change is also quoted in the ‘What’s New in 6.3.5’ section:

“Guest Introspection service VM will now ignore network events sent by guest VMs unless Identify Firewall or Endpoint Monitoring is enabled”

This is a feature that we’ve sometimes manually disabled in older releases in very large deployments to improve 3rd party A/V scalability. The vast majority of customers don’t use ‘Network Introspection’ services, so it’s good to see that it’s now off by default unless needed.

Those using Guest Introspection should definitely consider upgrading to 6.3.5.

Another interesting statement that I’m sure many people are interested in is:

“Host prep now has troubleshooting enhancements, including additional information for “not ready” errors”

EAM generally hasn’t provided very descriptive statements in the ‘Not Ready’ dialogue, so this is also a very welcome change.

And of course, the controller disconnect issue and password expiry issues are also fixed in 6.3.5. This obviously brings up a good question – should you still patch with the re-released versions of 6.3.3/6.3.4, or go straight to 6.3.5? I see no reason why you would not go straight to 6.3.5.

Here is some of the relevant information:

NSX 6.3.5 Download Page
NSX 6.3.5 Release Notes

Definitely have a read through the release notes – there are some real gems in there. I hope to get this deployed in my home lab sometime in the next few days!