If you are running Auto Deploy and noticed your VMs didn’t have connectivity after a host reboot or upgrade, you may have run into the problem described in VMware KB 52903. I’ve seen this a few times now with different customers and thought a PSA may be in order. You can find all the key details in the KB, but I thought I’d add some extra context here to help anyone who may want more information.
I recently helped to author VMware KB 52903, which has just been made public. Essentially, it describes a race condition causing a host to come up without its vdrPort connected to the distributed switch. The vdrPort is an important component on an ESXi host that funnels traffic to/from the NSX DLR module. If this port isn’t connected, traffic can’t make it to the DLR for east/west routing on that host. Technically, VMs in the same logical switches will be able to communicate across hosts, but none of the VMs on this impacted host will be able to route.
The race condition occurs when the DVS registration of the host occurs too late in the boot process. Normally, the distributed switch should be initialized and registered long before the vdrPort gets connected. In some situations, however, DVS registration can be late. Obviously, if the host isn’t yet initialized/registered with the distributed switch, any attempt to connect something to it will fail. And this is exactly what happens.
Using the log lines from KB 52903 as an example, we can see that the host attempts to add the vdrPort to the distributed switch at 23:44:19:
2018-02-08T23:44:19.431Z error netcpa[3FFEDA29700] [Originator@6876 sub=Default] Failed to add vdr port on dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d, Not found
The reason the operation fails is because the DVS switch with the UUID specified is not found from the perspective of this host. It simply hasn’t been initialized yet. A few moments later, the DVS is finally ready for use on the host. Notice the time stamps – you can see the registration of the DVS about 9 seconds later:
2018-02-08T23:44:28.389Z info hostd[4F540B70] [Originator@6876 sub=Hostsvc.DvsTracker] Registered Dvs 96 ff 2c 50 0e 5d ed 4a-e0 15 b3 36 90 19 41 5d
The above message can be found in /var/log/hostd.log.
Although technically this problem could occur to any ESXi host with a long delay in DVS registration, it seems to be much more likely to occur with auto-deploy stateless hosts. Because these hosts have many bootup activities relating to host profiles and the distributed switch, it seems that the conditions for this are more ideal.
In my experience, this possibility of hitting this issue can vary greatly. There are many factors, but the size of the environment as well as vCenter performance can play a role. One customer I’ve worked with seems to have about a one in five chance of running into this after a reboot. Another customer seems to hit it almost every time. And of course, there are many auto-deploy customers I’ve worked with who have never hit this issue.
Identifying an Impacted Host
Unfortunately, this issue isn’t easily idetifyable unless you know what you are looking for. Hosts that come up without a vdrPort do not report any kind of health check error in the UI. From the perspective of ‘host preparation’ as well as communication channel health, the host will look good.
To see if you’ve hit this problem, you’ll need to run net-vdr -C -l from the ESXi command line after the host has booted up or look for the logging examples mentioned earlier. Below is an example from my lab showing a missing vdrPort:
[root@esx-a1:~] net-vdr -C -l Host locale Id: 4226cdee-1dda-9ff9-9e2a-8fdd64facd35 Connection Information: ----------------------- DvsName VdrPort NumLifs VdrVmac ------- ------- ------- -------
You’ll also notice that the vdrPort isn’t listed when you run esxtop and view the ‘n’ screen.
Thankfully, the workaround to get a host in this condition up again is quite simple. All you need to do is restart the control plane agent on the host. Restarting netcpad will cause the host to check on its vdrPort and when it realizes it’s not there, will re-connect it. Because the host is on the DVS at this point, it’ll work 100% of the time. This can be done from the ESXi command line:
[root@esx-a1:~] /etc/init.d/netcpad restart watchdog-netcpaMonitor: Terminating watchdog process with PID 19773228 netCP agent service monitor is stopped watchdog-netcpa: Terminating watchdog process with PID 19773192 Memory reservation released for netcpa netCP agent service is stopped Memory reservation set for netcpa Reload security domains netCP agent service starts netCP agent service monitor is started
You can then check on the vdrPort using the following net-vdr command:
[root@esx-a1:~] net-vdr -C -l Host locale Id: 4226cdee-1dda-9ff9-9e2a-8fdd64facd35 Connection Information: ----------------------- DvsName VdrPort NumLifs VdrVmac ------- ------- ------- ------- dvs-compute-a vdrPort 8 02:50:56:56:44:52 Vdr Switch Port: 33554456 Teaming Policy: Default Teaming Uplink : Uplink 1(33554436): 00:50:56:f9:75:f4(Team member) Uplink : Uplink 2(33554434): 00:50:56:f9:a8:33(Team member) Stats : Pkt Dropped Pkt Replaced Pkt Skipped Input : 0 0 98039311 Output : 71020 2429 68459936
Notice it’s listed above with a MAC address beginning with OUI 02:50:56 and a valid switch port number this time.
Now obviously, the important thing here is to recognize that this problem has occurred before you move any production workload to the host. I.e. you need to check that the host’s vdrPort is connected before taking it out of maintenance mode.
Another thing you can do is create an alert or report in Log Insight to watch for failed vdrPort creation based on the log entries described earlier. The VMware engineering team is working on a proper fix for this issue that will be released in a future release of NSX, but until then it’s worth doing a quick check on rebooted hosts.
I hope this helps to provide some additional context to KB 52903. If there are any questions, please feel free to comment below or reach out to me on Twitter (@vswitchzero)