NSX Troubleshooting Scenario 1

Welcome to the first of what I hope to be many NSX troubleshooting posts. As someone who has been working in back-line support for many years, troubleshooting is really the bread and butter of what I do every day. Solving problems in vSphere can be challenging enough, but NSX adds another thick layer of complexity to wrap your head around.

I find that there is a lot of NSX documentation out there but most of it is on to how to configure NSX and how it works – not a whole lot on troubleshooting. What I hope to do in these posts is spark some conversation and share some of the common issues I run across from day to day. Each scenario will hopefully be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there. I hope to leave a gap of a few days between the problem and solution posts to give people some time to comment, ask questions and provide their thoughts on what the problem could be!

NSX Troubleshooting Scenario 1

As always, let’s start with a somewhat vague customer problem description:

“Help! I’ve deployed a new cluster (compute-b) and for some reason I can’t access internal web sites on the compute-a cluster or at any other internet site.”

Of course, this is really only a small description of what the customer believes the problem to be. One of the key tasks for anyone working in support is to scope the problem and put together an accurate problem statement. But before we begin, let’s have a look at the customer’s environment to better understand how the new compute-b cluster fits into the grand scheme of things.

Continue reading “NSX Troubleshooting Scenario 1”

Check NSX 6.2.x Compatibility Before Upgrading to 6.3.5!

Unlike previous 6.3.x releases, 6.3.5 has some new upgrade minimum version compatibility requirements. This is not only true from a vSphere perspective, but also for the version of NSX 6.2.x you are running. If you are running an older 6.2.0, 6.2.1 or 6.2.2 release of NSX, you’ll need to upgrade to at least 6.2.4 before taking the big step up to 6.3.5. VMware has just updated the NSX Upgrade Matrix to reflect this requirement:

622upg635-1
Screenshot taken from the VMware Interoperability Matrix site.

I expect that VMware will update the 6.3.5 release notes and release a new KB article very shortly. I’ll provide some more detail when that is out. In the meantime, please be sure to heed the version requirements or you will most likely run into problems.

Thankfully there aren’t too many customers still running these old releases of 6.2.x, but if you have already attempted the upgrade and hit problems, you’ll need to roll back. If you took a cold-snapshot of the manager or a clone, you can roll back that way. Otherwise, you’ll need to deploy the original 6.2.x OVA again and restore your FTP backup.

** Edit 11/29/2017: VMware has just updated the NSX 6.3.5 release notes to include mention of the minimum version requirements. The following statement was added:

Important: If you are upgrading NSX 6.2.0, 6.2.1, or 6.2.2 to NSX 6.3.5, you must complete a workaround before starting the upgrade. See VMware Knowledge Base article 000051624 for details.

VMware calls it a “workaround” but it’s basically just upgrading to an interim version before going to 6.3.5. In KB 000051624, VMware recommends going to 6.2.9 as that workflow has been tested. I.e. upgrading from 6.2.0 to 6.2.9, and then to 6.3.5. On a positive note, you only need to upgrade your NSX Manager to 6.2.9, no other components need to be upgraded before proceeding on to 6.3.5.

If you attempt an upgrade from 6.2.2 or older releases, my understanding is that the upgrade will appear to be completed successfully, but your configuration will be missing. VMware calls out the remediation steps of rolling back to the previous version should you run into this issue.

NSX Transport Zone Cluster Removal Issues

Ever remove a cluster from your NSX transport zone only to see it reappear on the list of clusters available for disconnection? Unfortunately, the task likely failed but NSX doesn’t always do a very good job of telling you why in the UI.

I was recently attempting to remove a cluster called compute-b from my transport zone so that I could remove and rebuild the hosts within. Needless to say, I ran into some difficulties and wanted to share my experience.

If you are interested in some more detailed instructions on how to decommission NSX prepared hosts, you can check out my post on Completely Removing NSX. From a high level, the steps I wanted to do were the following:

  1. Disconnect all VMs from logical switches in the cluster to be removed.
  2. Remove the cluster from the transport zone. This will remove all port groups associated with the logical switches (assuming no other clusters are connected to the same distributed switch)
  3. ‘Unconfigure’ VXLAN from the ‘Logical Network Preparation’ tab to remove all VTEPs.
  4. Uninstall the NSX VIBs from the Host Preparation Tab.

To begin, I used the ‘Remove VM’ button from the Logical Switches view in NSX. I removed all four of the VMs attached to the only Logical Switch being used at the moment. I saw a bunch of VM reconfigure tasks complete, and assumed it had completed successfully.

I then went to disconnect compute-b from the transport zone called Primary TZ. After removing the cluster and clicking OK, the dialog closed giving the impression that the task was successful. Oddly though, I didn’t see the tasks related to port group removal that I expected to see.

tzremove-1

Sure enough, I went back into the ‘Disconnect Clusters’ dialog and saw the compute-b cluster still in the list. Unfortunately, NSX doesn’t appear to report failures for this particular workflow in the UI.

Having worked in support for many years, I followed my first instinct and checked the NSX Manager vsm.log file for detail on why the operation failed. I received the below failure details:

2017-11-24 18:50:13.016 GMT+00:00 INFO taskScheduler-8 JobWorker:243 - Updating the status for jobinstance-101742 to EXECUTING
2017-11-24 18:50:13.022 GMT+00:00 INFO taskScheduler-8 SchedulerQueueServiceImpl:64 - [TF] Created a new bucket for module default_module and total number of buckets 1
2017-11-24 18:50:13.022 GMT+00:00 INFO taskScheduler-8 SchedulerQueueServiceImpl:80 - The task ShrinkVdnScope-vdnscope-1 (1511549412299) [id:task-102755] is added to the SchedulerQueue
2017-11-24 18:50:13.022 GMT+00:00 INFO pool-10-thread-1 ScheduleSynchronizer:48 - Start executing task: task-102755 and running executor threads 1
2017-11-24 18:50:13.042 GMT+00:00 INFO TaskFrameworkExecutor-3 VdnScopeServiceImpl$2:995 - New VDS (count: 1) is being removed when shrinking scope vdnscope-1. Shrinkingwires.
2017-11-24 18:50:13.061 GMT+00:00 ERROR TaskFrameworkExecutor-3 VirtualWireServiceImpl:1577 - validation failed at delete backing for dvportgroup-815 in the scope: vdnscope-1
2017-11-24 18:50:13.061 GMT+00:00 ERROR TaskFrameworkExecutor-3 VdnScopeServiceImpl$2:1015 - Shrink operation failed on TZ vdnscope-1
2017-11-24 18:50:13.061 GMT+00:00 ERROR TaskFrameworkExecutor-3 Worker:219 - BaseException thrown while executing task instance taskinstance-166334
com.vmware.vshield.vsm.vdn.exceptions.XvsException: core-services:819:Transport zone vdnscope-1 contraction error.
 at com.vmware.vshield.vsm.vdn.service.VirtualWireServiceImpl.validateShrink(VirtualWireServiceImpl.java:1578)
 at com.vmware.vshield.vsm.vdn.service.VdnScopeServiceImpl$2.doTask(VdnScopeServiceImpl.java:1001)
 at com.vmware.vshield.vsm.vdn.service.task.AbstractVdnRunnableTask.run(AbstractVdnRunnableTask.java:80)
 at com.vmware.vshield.vsm.task.service.Worker.runtask(Worker.java:184)
 at com.vmware.vshield.vsm.task.service.Worker.executeAsync(Worker.java:122)
 at com.vmware.vshield.vsm.task.service.Worker.run(Worker.java:99)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
2017-11-24 18:50:13.068 GMT+00:00 INFO TaskFrameworkExecutor-3 JobWorker:243 - Updating the status for jobinstance-101742 to FAILED

There is quite a bit there, but the key takeaways are that the task clearly failed and that the reason was: “validation failed at delete backing for dvportgroup-815 in the scope: vdnscope-1

This doesn’t tell us why exactly, but it seems clear that the operation can’t delete dvportgroup-815 and fails. In my experience, 99% of the time this is because there is still something connected to the portgroup.

Since there were only four VMs in the cluster, and no ESGs or DLRs – I wasn’t sure what could possibly be connected. I even shut down all four disconnected VMs and put all three hosts in maintenance mode just to be sure. None of these actions helped.

I then navigated to the Networking view in vCenter to have a look at the DVS associated with the compute-b cluster. In the ‘Ports’ view, you can get a good idea of what exactly is still connected to the distributed switch. To my surprise a VM called win-b1 was actually still showing as ‘Active’ and ‘Connected’ to the dvPortgroup associated with a Logical Switch!

tzremove-3

This dvPort state is clearly wrong – first of all, the VM was powered off so it could not be ‘Link Up’. Secondly I thought I had removed the VM. Or did I?

tzremove-4

Although I didn’t see any failures, it doesn’t appear that this VM was removed from the Logical Switch. Maybe I missed it, or perhaps it was a quirk due to the bug outlined in KB 2145889 where DirectPath I/O is enabled on VMs created with the vSphere Web Client. This was the only VM that had this option checked off, but despite my best efforts I could not reproduce the problem. Regardless, knowing what the problem was, I could simply disconnect the NIC and add it to another temporary portgroup.

This adjustment appeared to refresh the DVS port state and then I was able to remove the cluster from the Transport Zone successfully. 

When in doubt, don’t hesitate to dig into the NSX Manager logging. If the UI doesn’t tell you why something didn’t work or is light on details, the logging can often set you in the right direction!

NSX 6.3.5 Now Available!

Late yesterday, VMware made available NSX 6.3.5 (Build Number 7119875) for download. This is a full maintenance release including over 32 documented bug fixes. As you may recall, 6.3.4 was a minor patch release with only a few fixes, which is why 6.3.5 has been released so soon afterward.

There are numerous fixes in this release that will be of interest to a lot of customers. Of special note are the fixes related to Guest Introspection – a feature leveraged by 3rd party AV and security products, as well as the IDFW. I was very happy to see that GI got front and center attention in 6.3.5. Along with some needed CPU utilization fixes, there is also a fix for issue number 1897878, outlined in VMware KB 2151235 that I’ve seen quite a bit of in the wild.

Another interesting behavior change is also quoted in the ‘What’s New in 6.3.5’ section:

“Guest Introspection service VM will now ignore network events sent by guest VMs unless Identify Firewall or Endpoint Monitoring is enabled”

This is a feature that we’ve sometimes manually disabled in older releases in very large deployments to improve 3rd party A/V scalability. The vast majority of customers don’t use ‘Network Introspection’ services, so it’s good to see that it’s now off by default unless needed.

Those using Guest Introspection should definitely consider upgrading to 6.3.5.

Another interesting statement that I’m sure many people are interested in is:

“Host prep now has troubleshooting enhancements, including additional information for “not ready” errors”

EAM generally hasn’t provided very descriptive statements in the ‘Not Ready’ dialogue, so this is also a very welcome change.

And of course, the controller disconnect issue and password expiry issues are also fixed in 6.3.5. This obviously brings up a good question – should you still patch with the re-released versions of 6.3.3/6.3.4, or go straight to 6.3.5? I see no reason why you would not go straight to 6.3.5.

Here is some of the relevant information:

NSX 6.3.5 Download Page
NSX 6.3.5 Release Notes

Definitely have a read through the release notes – there are some real gems in there. I hope to get this deployed in my home lab sometime in the next few days!

New NSX Controller Issue Identified in 6.3.3 and 6.3.4.

Having difficulty deploying NSX controllers in 6.3.3? You are not alone. VMware has just made public a newly discovered bug impacting NSX controllers based on the Photon OS platform. This includes NSX 6.3.3 and 6.3.4.  VMware KB 000051144 provides a detailed summary of the symptoms, but essentially:

  • New NSX 6.3.3 Controllers will fail to deploy after November 2nd, 2017.
  • New NSX 6.3.4 Controllers will fail to deploy after January 1st, 2018.
  • Controllers deployed before this date will be prompting for a new password on login attempt.

That said, if you attempted a fresh deployment of NSX 6.3.3 today, you would not be able to deploy a control cluster.

The issue appears to stem from root and admin account credentials expiring 90 days after the creation of the NSX build. This is not 90 days after it’s deployed, but rather 90 days after the build was created by VMware. This is why NSX 6.3.3 will begin having issues after November 2nd and 6.3.4 will be fine until January 1st 2018.

Some important points:

  • If you have already deployed NSX 6.3.3 or 6.3.4, don’t worry – your controllers will continue to function just fine. Having expired admin/root passwords will not break communication between NSX components.
  • This issue does not pose any kind of datapath impact. It will only pose issues if you attempt a fresh deployment, attempt to upgrade or delete and re-deploy controllers.
  • Until you’ve had a chance to implement the workaround in KB 000051144, you should obviously avoid any of the mentioned workflows.

It appears that VMware will be re-releasing new builds of the existing 6.3.3 and 6.3.4 downloads with the fix in place, along with a fix in 6.3.5 and future releases. They have already added the following text to the 6.3.3 and 6.3.4 release notes:

Important information about NSX 6.3.3: NSX for vSphere 6.3.3 has been repackaged to address the problems mentioned in VMware Knowledge Base articles 2151719 and 000051144. The originally released build 6276725 is replaced with build 7087283. Please refer to the Knowledge Base articles for more detail. See Upgrade Notes for upgrade information.

Old 6.3.3 Build Number: 6276725
New 6.3.3 Build Number: 7087283

Old 6.3.4 Build Number: 6845891
New 6.3.4 Build Number: 7087695

As an added bonus, VMware took advantage of this situation to include the fix for the NSX controller disconnect issue in 6.3.3 as well. This other issue is described in VMware KB 2151719. Despite what it says in the 6.3.4 release notes, only 6.3.3 was susceptable to the issue outlined in KB 2151719.

If you’ve already found yourself in this predicament, VMware has provided an API call that can be used as a workaround. The API call appears to correct the issue by setting the appropriate accounts to never expire. If the password has already expired, it’ll reset it. It’s then up to you to change the password. Detailed steps can be found in KB 000051144.

It’s unfortunate that another controller issue has surfaced after the controller disconnect issue discovered in 6.3.3. Whenever there is a major change like the introduction of a new underlying OS platform, these things can clearly be missed. Thankfully the impact to existing deployments is more of an inconvenience than a serious problem. Kudos to the VMware engineering team for working so quickly to get these fixes and workarounds released!

 

Building a Retro Gaming Rig – Part 5

In Part 4 of this series, I took a look at some sound card options. I’m now getting a lot closer to having this build finished, but there is still one key piece missing – storage.

When dealing with old hardware, hard drives simply don’t age well. Anything with moving parts is prone to failure and degradation over time. Not only this, but as the bearings wear down, the drives begin to have an annoying whine and droning noise that can be heard rooms away.

I’m all for the genuine nostalgic experience, but slow and noisy drives with 20 years of wear behind them are not something I’m particularly interested in. That said, I knew that I wanted to retrofit a modern storage solution to work with this machine.

Challenges and Limitations

Having worked with older hardware before, I was prepared for some challenges along the way. There are numerous drive size limitations and other BIOS quirks that I’d need to navigate around. Below are just a few:

  • Most 486 and older systems are limited to a 504MB hard drive due to a limit of 1024 cylinders being supported in the BIOS.
  • Many systems in the late nineties simply didn’t support drives larger than 32GB due to other BIOS limitations.
  • With a newer BIOS, some IDE systems can support drives as large as 128GB, which was the LBA limit with an ATA interface.

Clearly there are newer IDE drives with capacities beyond 128GB, but these drives require newer Ultra ATA 100/133 controllers. After doing some testing, I discovered that the Asus P2B that I outlined in Part 1 of this series had a 32GB drive limitation with the latest production BIOS and a 128GB limitation with the newest beta BIOS. The MSI MS-6160 that I covered in Part 3 was limited to 32GB. Since this was the board I wanted to use, I could only consider IDE solutions of 32GB or less if I wanted to stick with the onboard controller.

Continue reading “Building a Retro Gaming Rig – Part 5”

VUM Challenges During vCenter 6.5 Upgrade

After procrastinating for a while, I finally started the upgrade process in my home lab to go from vSphere 6.0 to 6.5. The PSC upgrade was smooth, but I hit a roadblock when I started the upgrade process on the vCenter Server appliance.

After going through some of the first steps in the process, I ran into the following error when trying to connect to the source appliance.

vumupgrade-1

The exact text of the error reads:

“Unable to retrieve the migration assistant extension on source vCenter Server. Make sure migration assistant is running on the VUM server.”

I had forgotten that I even had Update Manager deployed. Because my lab is small, I generally applied updates manually to my hosts via the CLI. What I do remember, however, is being frustrated that I had to deploy a full-scale Windows VM to run the Update Manager service.

Continue reading “VUM Challenges During vCenter 6.5 Upgrade”

Building a Retro Gaming Rig – Part 4

Welcome to part 4 of my Building a Retro Gaming Rig series. Today I’ll be looking at some sound cards for the build.

Back in the early nineties when I first started taking an interest in PC gaming, most entry-level systems didn’t come with a proper sound card. I still remember playing the original Wolfenstein 3D using the integrated PC speaker on my friend’s 386 system. All of the beeps, boops and tones that speaker could produce still feel somewhat nostalgic to me. We had a lot of fun with games of that era so didn’t really think much about it. It wasn’t until 1994 that I got my first 486 system and a proper Sound Blaster 16. It was then that I really realized what I was missing out on. Despite having really crappy non-amplified speakers, the FM synthesized MIDI music and sound effects were just so awesome. And who can forget messing around with ‘Sound Recorder’ or playing CD audio in Windows 3.11!

With all that in mind, it was clear that I needed a proper sound card for my retro build. But that really isn’t just a ‘checkbox’ to tick – on machines of this era there really was quite a difference between cards and to get the proper vintage experience I’d have to choose correctly.

Continue reading “Building a Retro Gaming Rig – Part 4”

Debunking the VM Link Speed Myth!

10Gbps from a 10Mbps NIC? Why not? Debunking the VM link speed myth once and for all!

** Edit on 11/6/2017: I hadn’t noticed before I wrote this post, but Raphael Schitz (@hypervisor_fr) beat me to the debunking! Please check out his great post on the subject as well here. **

I have been working with vSphere and VI for a long time now, and have spent the last six and a half years at VMware in the support organization. As you can imagine, I’ve encountered a great number of misconceptions from our customers but one that continually comes up is around VM virtual NIC link speed.

Every so often, I’ll hear statements like “I need 10Gbps networking from this VM, so I have no choice but to use the VMXNET3 adapter”, “I reduced the NIC link speed to throttle network traffic” and even “No wonder my VM is acting up, it’s got a 10Mbps vNIC!”

I think that VMware did a pretty good job documenting the role varying vNIC types and link speed had back in the VI 3.x and vSphere 4.0 era – back when virtualization was still a new concept to many. Today, I don’t think it’s discussed very much. People generally use the VMXNET3 adapter, see that it connects at 10Gbps and never look back. Not that the simplicity is a bad thing, but I think it’s valuable to understand how virtual networking functions in the background.

Today, I hope to debunk the VM link speed myth once and for all. Not with quoted statements from documentation, but through actual performance testing.

Continue reading “Debunking the VM Link Speed Myth!”

NSX 6.2.9 Now Available for Download!

Although NSX 6.3.x is getting more time in the spotlight, VMware continues to patch and maintain the 6.2.x release branch. On October 26th, VMware made NSX for vSphere 6.2.9 (Build Number 6926419) available for download.

Below are the relevant links:

This is a full patch release, not a minor maintenance release like 6.2.6 and 6.3.4 were. VMware documents a total of 26 fixed issues in the release notes. Some of these are pretty significant relating to everything from DFW to EAM and even some host PSOD fixes. Definitely have a look through the resolved issues section of the release notes for more detail.

On a personal note, I’m really happy to see NSX continue to mature and become more and more stable over time. Working in the support organization, I can confidently say that many of the problems we used to see often are just not around any more – especially with host preparation and the control plane. The pace in which patch releases for NSX come out is pretty quick and some may argue that it is difficult to keep up with. I think this is just something that must be expected when you are working with state of the art technology like NSX. That said, kudos to VMware Engineering for the quick turnaround on many of these identified issues!