NSX Troubleshooting Scenario 6 – Solution

As we saw in the first half of scenario 6, a fictional administrator enabled the DFW in their management cluster, which caused some unexpected filtering to occur. Their vCenter Server was no longer allowed the necessary HTTPS port 443 traffic needed for the vSphere Web Client to work.

Since we can no longer manage the environment or the DFW using the UI, we’ll need to revert this change using some other method.

As mentioned previously, we are fortunate in that NSX Manager is always excluded from DFW filtering by default. This is done to protect against this very type of situation. Because the NSX management plane is still fully functional, we should – in theory – still be able to relay API based DFW calls to NSX Manager. NSX Manager will in turn be able to publish these changes to the necessary ESXi hosts.

There are two relatively easy ways to fix this that come to mind:

  1. Use the approach outlined in KB 2079620. This is the equivalent of doing a factory reset of the DFW ruleset via API. This will wipe out all rules and they’ll need to be recovered or recreated.
  2. Use an API call to disable the DFW in the management cluster. This will essentially revert the exact change the user did in the UI that started this problem.

There are other options, but above two will work to restore HTTP/HTTPS connectivity to vCenter. Once that is done, some remediation will be necessary to ensure this doesn’t happen again. Rather than picking a specific solution, I’ll go through both of them.

Continue reading

NSX Troubleshooting Scenario 6

Welcome to the sixth installment of my new NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

NSX Troubleshooting Scenario 6

As always, we’ll start with a brief customer problem statement:

“Help! It looks like we accidentally blocked access to vCenter Server! We have two clusters, a compute and a management cluster. My colleague noticed the firewall was disabled on the management cluster and turned it on. As soon as he did that we lost all access to the vSphere Web Client.”

Well, this sounds like a ‘chicken or the egg dilemma’ – how can they recover if they can’t log in to the vSphere Web Client to revert the changes that broke things?

In speaking with our fictional customer, we learn that some rules are in place to block all HTTP/HTTPS access in the compute cluster. Because they are still deploying VMs and getting everything patched, they are using this as a temporary means to prevent all web access. Unfortunately, he can’t remember exactly what was configured in the firewall and there may be other restrictions in place.

This was a screenshot of the last thing he saw before his web client session started timing out:


Starting with some basic ping tests, we can see that the vCenter Server and NSX Manager are both still accessible from a layer-3 perspective:

Continue reading

Unboxing a 22 Year Old Microsoft Mouse

Finding a functional serial mouse for my ongoing 486 restoration project has been a challenge. Up until now, my retro rigs have had PS/2 ports that work with a variety of older optical mice. This isn’t the case with many custom-built systems from the early to mid-nineties. Unless your system was an IBM or some other name brand, you likely had to use a serial mouse.

Because of the peripheral divide in those days, there was demand for PS/2 as well as serial mice. This prompted manufacturers to create what was then known as ‘combo mice’. These mice would come with a simple PS/2 to serial adapter to allow support for both standards. When it came to keyboards, most if not all PS/2 keyboards were compatible with the common 5-pin DIN connector with a simple adapter. This is because the two connectors are electrically compatible and just need pin translation. With mice, however, this is not the case. For a PS/2 mouse to work with a PS/2 to serial adapter, it must have hardware support for both standards under the hood. Today I’m going to be looking at one of the iconic combo mice from the mid-nineties – the Microsoft Mouse.


I was fortunate enough to find this ‘new old stock’ mouse on eBay from a Canadian seller. It was brand new and still sealed, which is quite rare these days. Most of the serial compatible mice I’ve come across are quite worse for wear and demand exorbitant prices.

Continue reading

Missing NSX vdrPort and Auto Deploy

If you are running Auto Deploy and noticed your VMs didn’t have connectivity after a host reboot or upgrade, you may have run into the problem described in VMware KB 52903. I’ve seen this a few times now with different customers and thought a PSA may be in order. You can find all the key details in the KB, but I thought I’d add some extra context here to help anyone who may want more information.

I recently helped to author VMware KB 52903, which has just been made public. Essentially, it describes a race condition causing a host to come up without its vdrPort connected to the distributed switch. The vdrPort is an important component on an ESXi host that funnels traffic to/from the NSX DLR module. If this port isn’t connected, traffic can’t make it to the DLR for east/west routing on that host. Technically, VMs in the same logical switches will be able to communicate across hosts, but none of the VMs on this impacted host will be able to route.

The Problem

The race condition occurs when the DVS registration of the host occurs too late in the boot process. Normally, the distributed switch should be initialized and registered long before the vdrPort gets connected. In some situations, however, DVS registration can be late. Obviously, if the host isn’t yet initialized/registered with the distributed switch, any attempt to connect something to it will fail. And this is exactly what happens.

Using the log lines from KB 52903 as an example, we can see that the host attempts to add the vdrPort to the distributed switch at 23:44:19:

2018-02-08T23:44:19.431Z error netcpa[3FFEDA29700] [Originator@6876 sub=Default] Failed to add vdr port on dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d, Not found

The reason the operation fails is because the DVS switch with the UUID specified is not found from the perspective of this host. It simply hasn’t been initialized yet. A few moments later, the DVS is finally ready for use on the host. Notice the time stamps – you can see the registration of the DVS about 9 seconds later:

2018-02-08T23:44:28.389Z info hostd[4F540B70] [Originator@6876 sub=Hostsvc.DvsTracker] Registered Dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d

The above message can be found in /var/log/hostd.log.

Continue reading

Configuring a Proxy in Photon OS

I’ve been playing around recently with VMware’s new Photon OS platform. Thanks to it’s incredibly small footprint and virtualization-specific tuning, it looks like an excellent building block for a custom appliance I’m hoping to build. To keep the appliance as small as possible, I used the minimal deployment and then planned to install packages as required.

After deploying the appliance, I hit a roadblock as the package management tool called tdnf couldn’t reach any of the repositories. This was expected as my home lab is isolated and I have to go through a squid proxy server to get to the outside world.

root@photon-machine [ ~ ]# tdnf repolist
curl#7: Couldn't connect to server
Error: Failed to synchronize cache for repo 'VMware Photon Linux 2.0(x86_64) Updates' from 'https://dl.bintray.com/vmware/photon_updates_2.0_x86_64'
Disabling Repo: 'VMware Photon Linux 2.0(x86_64) Updates'
curl#7: Couldn't connect to server
Error: Failed to synchronize cache for repo 'VMware Photon Linux 2.0(x86_64)' from 'https://dl.bintray.com/vmware/photon_release_2.0_x86_64'
Disabling Repo: 'VMware Photon Linux 2.0(x86_64)'
curl#7: Couldn't connect to server
Error: Failed to synchronize cache for repo 'VMware Photon Extras 2.0(x86_64)' from 'https://dl.bintray.com/vmware/photon_extras_2.0_x86_64'
Disabling Repo: 'VMware Photon Extras 2.0(x86_64)'

When trying to build the package cache, you can see that the the synchronization fails to specific HTTPS locations over port 443.

After having a quick look through the Photon administration guide, I was surprised to see that there wasn’t anything regarding proxy configuration listed – at least not at the time of writing. Doing some digging online turned up several possibilities. There seems to be numerous places in which a proxy can be defined – including in the kubernetes configuration, or specifically for the tdnf package manager.

The simplest way to get your proxy configured for tdnf, as well as other tools like WGET and Curl is to define a system-wide proxy. You’ll find the relevant configuration in the /etc/sysconfig/proxy file:

Continue reading

The 486 Restoration – Part 3

Welcome to part three of my 486 restoration project! Check out part one and two for more information on the parts I rescued from a badly neglected machine. I’m happy to report that the purchase of this banged up machine was not in vain. It didn’t come without it’s share of challenges but as you’ll see in this installment – it’s alive!

After removing the barrel battery and constructing an external battery pack in part 2, the next order of business was to get the machine put together on the work bench and powered up.


My test-bench isn’t pretty but it’s functional!

I’m using a modern PFC Seasonic 350W power supply with an AT 12-pin adapter. These old systems run almost entirely on 5 volt power and draw nothing from the 3.3V and little from the +12V rails. This can cause problems with some newer PSUs, but this Seasonic model fairs well with a 130W rating on the 5V rail. The only side effect of this power draw imbalance is a higher than usual +12.6V on the 12V rail. It’s not ideal, but I’d rather this than a flaky 25 year old AT power supply.

Since the system didn’t come with a video card, I pulled out an old ATI Mach 32 ISA card from the parts bin.

I recently picked this up from the great folks in the computer recycling department of The Working Center in Kitchener. It was sitting in a box full of old PCI graphics cards destined for e-waste. It’s always awesome to keep classic parts out of the landfill and support a great cause at the same time.

Continue reading

NSX Troubleshooting Scenario 5 – Solution

Welcome to the fifth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario five. Today I’ll be performing some troubleshooting and will resolve the issue.

Please see the first half for more detail on the problem symptoms and some scoping.

Reader Suggestions

There were a few good suggestions from readers. Here are a couple from Twitter:

Good suggestions – we want to ensure that the distributed firewall dvFilters are applied to the vNICs of the VMs in question. Looking at the rules from the host’s perspective is also a good thing to check.

The suggestion about VMware tools may not seem like an obvious thing to check, but you’ll see why in the troubleshooting below.

Getting Started

In the first half of this scenario, we saw that the firewall rule and security group were correctly constructed. As far as we could tell, it was working as intended with two of the three VMs in question.


Only the VM lubuntu-1.lab.local seemed to be ignoring the rule and was instead hitting the default allow rule at the bottom of the DFW. Let’s summarize:

  • VM win-a1 and lubuntu-2 are working fine. I.e. they can’t browse the web.
  • VM lubuntu-1 is the only one not working. I.e. it can still browse the web.
  • The win-a1 and lubuntu-2 VMs are hitting rule 1005 for HTTP traffic.
  • The lubuntu-1 VM is hitting rule 1001 for HTTP traffic.
  • All three VMs have the correct security tag applied.
  • All three VMs are indeed showing up correctly in the security group due to the tag.
  • The two working VMs are on host esx-a1 and the broken VM is on host esx-a2

To begin, we’ll use one of the reader suggestions above. I first want to take a look at host esx-a2 and confirm the DFW is correctly synchronized and that the lubuntu-1 VM does indeed have the DFW dvFilter applied to it.

Continue reading