Time for another NSX troubleshooting scenario! Welcome to the fourth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 4
As always, we’ll start with a customer problem statement:
“We recently deployed Cross-vCenter NSX for a remote datacenter location. When we try to add VMs to the universal logical switches, there are no VM vNICs in the list to add. This works fine at the primary datacenter.”
This customer’s environment will be the same as what we outlined in scenario 3. Keep in mind that this should be treated separately. Forget everything from the previous scenario.
The main location is depicted on the left. A three host cluster called compute-a exists there. All of the VLAN backed networks route through a router called vyos. The Universal Control Cluster exists at this location, as does the primary NSX manager.
The ‘remote datacenter’ is to the right of the dashed line. The single ‘compute-r’ cluster there is associated with the secondary NSX manager at that location. According to the customer, this was only recently added.
Thinking about upgrading to NSX 6.4.0? As I discussed in my recent Ten Tips for a Successful NSX Upgrade post, it’s always a good idea to do your research before upgrading. Along with reading the release notes, checking the VMware compatibility Matrix is essential.
VMware just updated some of the compatibility matrices to include information about 6.4.0. Here are the relevant Links:
From an NSX upgrade path perspective, you’ll be happy to learn that any current build of NSX 6.2.x or 6.3.x should be fine. At the time of writing, this would be 6.2.9 and earlier as well as 6.3.5 and earlier.
On a positive note, VMware required a workaround to be done for some older 6.2.x builds to go to 6.3.5, but this is no longer required for 6.4.0. The underling issue that required this has been resolved.
From a vCenter and ESXi 6.0 and 6.5 perspective, the requirements for NSX 6.4.0 remain largely unchanged from late 6.3.x releases. What you’ll immediately notice is that NSX 6.4.0 is not supported with vSphere 5.5. If you are running vSphere 5.5, you’ll need to get to at least 6.0 U2 before considering NSX 6.4.0.
From the NSX 6.4.0 release notes:
Supported: 6.0 Update 2, 6.0 Update 3 Recommended: 6.0 Update 3. vSphere 6.0 Update 3 resolves the issue of duplicate VTEPs in ESXi hosts after rebooting vCenter server. See VMware Knowledge Base article 2144605 for more information.
Supported: 6.5a, 6.5 Update 1 Recommended: 6.5 Update 1. vSphere 6.5 Update 1 resolves the issue of EAM failing with OutOfMemory. See VMware Knowledge Base Article 2135378 for more information.
Note: vSphere 5.5 is not supported with NSX 6.4.0.
It doesn’t appear that the matrix has been updated yet for other VMware products that interact with NSX, such as vCloud Director.
Before rushing out to upgrade to NSX 6.4.0, be sure to check for compatibility – especially if you are using any third party products. It may be some time before other vendors certify their products for 6.4.0.
Stay tuned for a closer look at some of the new NSX 6.4.0 features!
The long anticipated NSX 6.4.0 (build number 7564187) has finally been released. There is a long list of new features that I’ll hopefully cover in more depth in a future post. For now, here are a few highlights:
Layer-7 Distributed Firewall – The DFW now supports some layer-7 application context.
IDFW – The IDFW now supports user sessions via RDP.
HTML5 – Some of the new functionality is done in HTML5. You can see the supported functions here.
New Upgrade Coordinator – A single portal that simplifies the upgrade process! More on this later.
New HTML5 dashboard – The new NSX default home page.
System Scale – A new dashboard that helps you to monitor maximums and supported scale parameters.
Single Click Support Bundles – An awesome feature that GSS will appreciate!
API Improvements – Now supporting JSON and XML for formatting.
NSX Edge Enhancements – Many improvements to BGP, NAT and faster HA failovers.
I think most people will agree that the new L7 firewall is the most exciting new feature. This really takes micro segmentation to the next level and provides features that were only possible with 3rd party add-on products in the past.
I think it’s also important to note that there are many bug fixes in NSX 6.4.0. Not only does it provide many new features, but it fixes bugs found in earlier releases. There are over 33 bug fixes alone in the NSX 6.4.0 release!
Thanks to everyone who took the time to comment on the first half of scenario 3, both here and on twitter. There were many great suggestions, and some were spot-on!
For more detail on the problem, some diagrams and other scoping information, be sure to check out the first half of scenario 3.
During the initial scoping in the first half, we didn’t really see too much out of the ordinary in the UI aside from some odd ‘red alarm’ exclamation marks on the compute-r hosts in the Host Preparation section.
More than one commenter pointed out that this needs to be investigated. I wholeheartedly agree. Despite seeing a green status for host VIB installation, firewall status and VXLAN, there is clearly still a problem. That problem is related to ‘Communication Channel Health’.
The communication channel health check was a new feature added in NSX 6.2 and makes it easy to see which hosts are having problems communicating with both NSX Manager and the Control Cluster. In our case, both esx-r1 and esx-r2 are reporting problems with their control plane agent (netcpa) to all three controllers.
Welcome to the third installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.
NSX Troubleshooting Scenario 3
I’ll start off again with a brief customer problem description:
“We’ve recently deployed Cross-vCenter NSX for a remote datacenter location. All of the VMs at that location don’t have connectivity. They can’t ping their gateway, nor can they ping VMs on other hosts. Only VMs on the same host can ping each other.”
This is a pretty vague description of the problem, so let’s have a closer look at this environment. To begin, let’s look at the high-level physical interconnection between datacenters in the following diagram:
There isn’t a lot of detail above, but it helps to give us some talking points. The main location is depicted on the left. A three host cluster called compute-a exists there. All of the VLAN backed networks route through a router called vyos. The Universal Control Cluster exists at this location, as does the primary NSX manager.
As you’ve probably noticed, VMware is regularly releasing new NSX versions and updates to introduce new features and to improve stability and scalability. Eventually, you’ll find yourself in a situation where you’ll either want or need to upgrade. Maybe you want to take advantage of some new features, encountered a problem or your version isn’t supported any more. Whatever the reason, and whatever the version, here are ten tips that will help to ensure your upgrade is successful!
Tip 1 – Check The Compatibility Matrix
Before getting started, you’ll want to thoroughly check the compatibility of your target NSX version. That doesn’t just mean checking if you can upgrade from version X to version Y, but rather to check everything that interacts with NSX in the environment.
Start with the NSX Upgrade Path found at the VMware Interoperability Matrices page. There you may be surprised to find that there are several versions of NSX that are not a feasible upgrade path. For example, you can’t upgrade from NSX 6.2.8 to 6.3.2, nor can you upgrade from 6.2.6 to 6.3.1.
Once you’ve confirmed that your target version is supported in the upgrade path, you’ll want to look at the Interoperability Matrix to ensure products like vSphere and Cloud Director are compatible. Again, there are several incompatible releases that you may not expect. For example, NSX 6.3.3 and later releases aren’t compatible with vCenter Server 6.0 U1 and older, but are compatible with all releases of 5.5. Another example is the initial release of vSphere 6.5. Only 6.5a or later can be used with any version of NSX 6.3.x.
If you are new to NSX or looking to evaluate it in the lab, there is one very common issue that you may run into. After going through the initial steps of deploying and registering NSX Manager with vCenter, you may be surprised to find that there are no manageable NSX managers listed under ‘Networking and Security’ in the Web Client. Although the registration and Web Client plugin installation appears successful, there is often an extra step needed before you can manage things.
One of the first tasks involved in deploying NSX is to register NSX Manager with a vCenter Server. This is done for inventory management and synchronization purposes. The NSX Manager can be optionally registered with SSO as well.
The vCenter user that is used for registration needs to have the highest level of privileges for NSX to work correctly. The NSX install guide clearly states that this must be the vCenter ‘Administrator’ role.
From the NSX Install Guide:
“You must have a vCenter Server user account with the Administrator role to synchronize NSX Manager with the vCenter Server. If your vCenter password has non-ASCII characters, you must change it before synchronizing the NSX Manager with the vCenter Server.”
Because of these requirements, it’s quite common to use the SSO administrator account – usually firstname.lastname@example.org. A service account is also often created for this purpose to more easily identify and distinguish NSX tasks. Either way, these are not normally accounts that you’d use for day-to-day administration in vSphere.
By default, NSX will only assign its ‘Enterprise Administrator’ role to the user account that was used to register it with vCenter Server. This means that by default, only that specific vCenter user will have access to the NSX manager from within the Web Client.
That said, if you are experiencing this problem, you are probably not logged in with the vCenter user that was used for registration purposes. To grant access to other users, you’ll need to log into the vSphere Web Client using the registration user account, and then add additional users and groups.
In my lab, I’ve just logged in with an active directory user called ‘email@example.com’. This user has full administrator privileges in vCenter, but has no access to any NSX Managers:
If I log out, and log back in with the firstname.lastname@example.org account that was used for vCenter registration, I can see the NSX managers that were registered.
In my lab, I’ve got a secondary deployed as well, but we’ll focus only on 172.16.10.40. If I click on that manager in the list, I’m able to go to the ‘Users’ tab to see what the default permissions look like:
As you can see, only one user – the SSO administrator account used for registration – has the requisite role for administrator via the Web Client. In my lab, I want to provide full access to an AD group called ‘VMware Admins’ and an individual user called ‘Test’.
Both vCenter users and groups can be specified here. As long as vCenter can authenticate them – either via SSO, local authentication or even AD – they are fair game.
Another common mistake made is selecting the NSX Administrator role rather than Enterprise Administrator. NSX Administrator sounds like the highest privilege level, but it’s actually Enterprise Administrator that gives you all the keys to the kingdom. You won’t be able to administer certain things – including user permissions – unless Enterprise Administrator is chosen.
Once this is done, you’ll see the users and groups listed and should now have the correct permissions to administer the NSX deployment!
Keep in mind that if you’ve got more than one NSX manager deployed, you’ll need to set this on each independently.
Have any questions or want more information? Please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)
ARP suppression is one of the key fundamental features in NSX that helps to make the product scalable. By intercepting ARP requests from VMs before they are broadcast out on a logical switch, the hypervisor can do a simple ARP lookup in its own cache or on the NSX control cluster. If an ARP entry exists on the host or control cluster, the hypervisor can respond directly, avoiding a costly broadcast that would likely need to be replicated to many hosts.
ARP Suppression has existed in NSX since the beginning, but it was only available for VMs connected to logical switches. Up until NSX 6.2.4, the DLR kernel module did not benefit from ARP suppression and every non-cached entry needed to be broadcast out. Unfortunately, the DLR – like most routers – needs to ARP frequently. This can be especially true due to the easy L3 separation that NSX allows using logical switches and efficient east-west DLR routing.
Despite having code in the 6.2.4 and later version DLRs to take advantage of ARP suppression, a large number of deployments are likely not actually taking advantage of this feature due to a recently identified problem.
VMware KB 51709 briefly describes this issue, and makes note of the following conditions:
“DLR ARP Suppression may not be effective under some conditions which can result in a larger volume of ARP traffic than expected. ARP traffic sent by a DLR will not be suppressed if an ESXi host has more than one active port connected to the destination VNI, for example the DLR port and one or more VM vNICs.”
What isn’t clear in the KB article, but can be inferred based on the solution is that the problem is related to VLAN tagging on logical switch dvPortgroups. Any dvPortgroup associated with a logical switch with a VLAN ID specified is impacted by this problem.
It’s been a while since my last retro build post, but the build is finally complete! Actually, it’s been done for several weeks, but I just haven’t have time to break out the camera and get some pictures.
Without further ado, let’s have a look at the finishing touches and the final build!
For the case, I decided to use an old Antec NSK 3480 that I had lying around. I love the very simple appearance, but most of all, the small dimensions. With a depth of only 14 inches, it’s only as deep as it is tall. This can make for some challenges, but also makes it a perfect fit for slim boards.
Although it really doesn’t look like a retro box, I like that it’s very unassuming and feels a lot like a ‘sleeper’ build. I’ve got an old Pentium 90 and yellowed 486 rig that I’ll save for that true old-school appearance!
The Antec case was perfect for the MSI MS-6160 board. It even left enough space at the front of the case for the IDE-to-SD adapter board. For more detail on the motherboard, check out Part 3 of the series.
Welcome to the second installment of a new series of NSX troubleshooting scenarios. This is the second half of scenario two, where I’ll perform some troubleshooting and resolve the problem.
Please see the first half for more detail on the problem symptoms and some scoping.
As mentioned in the first half, the problem is limited to a host called esx-a1. As soon as a guest moves to that host, it has no network connectivity. If we move a guest off of the host, its connectivity is restored.
We have one VM called win-a1 on host esx-a1 for testing purposes at the moment. As expected, the VM can’t be reached.
To begin, let’s have a look at the host from the CLI to figure out what’s going on. We know that the UI is reporting that it’s not prepared and that it doesn’t have any VTEPs created. In reality, we know a VTEP exists but let’s confirm.
To begin, we’ll check to see if any of the VIBs are installed on this host. With NSX 6.3.x, we expect to see two VIBs listed – esx-vsip and esx-vxlan.