Creating a Python Replacement for the Test-Connection cmdlet.

I’ve been working on a PowerShell script to automatically power up and shut down all devices in my home lab. Both in the interest of saving on hydro and to reduce it’s environmental impact. To do this, I’ve been using a low-power Raspberry Pi 3 B+, which will stay powered on to orchestrate these and other scripted activities.

One great PowerShell cmdlet that comes in handy is Test-Connection. Although it can be used for numerous network testing purposes, it’s most commonly used to send an ICMP echo request to a host to see if it responds or not. The cmdlet responds with a simple true or false response, which makes it very handy for scripting purposes. I was really hoping to use this cmdlet because it makes it easy to determine if a device is still online or not and if it’s okay to move on to the next phase of powering things on or shutting things down. For example, I don’t want to power on my management ESXi host until the freenas SAN is online.

The problem I ran into was that not all PowerShell cmdlets have been ported over to PowerShell Core for Linux. Test-Connection is unfortunately one of them due to the way it uses the Microsoft Windows network stack to function. Looking at the GitHub PR, it seems they are getting close to porting it over, but I decided to try my hand at creating a python script with similar basic functionality.

I should note that I’m not a programmer by any stretch of the imagination and it’s been many years since I’ve done any serious coding. This is actually the first bit of python code I’ve written, so I’m sure there are many more efficient ways to achieve what I’ve done here:

import os
import sys

# If no arguments parsed, display the greeting:
if len(sys.argv) == 1:
    print("vswitchzero.com ICMP response script. Specify up to 3 hosts to ping separated by spaces.")
    print("example: python ./pinghost.py 172.16.10.15 192.168.1.1 vc.lab.local")

# If one arg parsed, ping a single host
if len(sys.argv) == 2:
    host1 = sys.argv[1]
    response = os.system("ping -c 1 " + host1 + "> /dev/null")
    if response == 0:
        print host1, 'is responding'
    else:
        print host1, 'is not responding'

# If two args parsed, ping two hosts
if len(sys.argv) == 3:
    host1 = sys.argv[1]
    host2 = sys.argv[2]
    response1 = os.system("ping -c 1 " + host1 + "> /dev/null")
    if response1 == 0:
        print host1, 'is responding'
    else:
        print host1, 'is not responding'
    response2 = os.system("ping -c 1 " + host2 + "> /dev/null")
    if response2 == 0:
        print host2, 'is responding'
    else:
        print host2, 'is not responding'

# If three args parsed, ping three hosts
if len(sys.argv) == 4:
    host1 = sys.argv[1]
    host2 = sys.argv[2]
    host3 = sys.argv[3]
    response1 = os.system("ping -c 1 " + host1 + "> /dev/null")
    if response1 == 0:
        print host1, 'is responding'
    else:
        print host1, 'is not responding'
    response2 = os.system("ping -c 1 " + host2 + "> /dev/null")
    if response2 == 0:
        print host2, 'is responding'
    else:
        print host2, 'is not responding'
 response3 = os.system("ping -c 1 " + host3 + "> /dev/null")
    if response3 == 0:
        print host3, 'is responding'
    else:
        print host3, 'is not responding'

# If more than three args parsed, display an error
if len(sys.argv) > 4:
    print('vswitchzero.com ICMP response script. Specify up to 3 hosts to ping separated by spaces.')
    print('example: python ./pinghost.py 172.16.10.15 192.168.1.1 vc.lab.local')
    print(' ')
    print('ERROR: Too many arguments specified. Use 1-3 IPs or hostnames only. Each should be separated by a space')

Continue reading

Dell BMC Problems with FreeNAS/FreeBSD

I’ve recently been working on a scripting project to orchestrate the power up and power down of my entire lab environment. As part of this, I’ve been using IPMI commands to power on physical servers in the correct order and at the correct time.

As I discussed in my recent FreeNAS Build Series, I’ve been using a Dell T110 tower server for storage purposes in my lab. Being an entry level server, the T110 has a very trimmed down iDRAC BMC (Baseboard Management Controller) that doesn’t have a dedicated NIC or a web based management page. Despite it’s limitations, I can still use the IPMI protocol to gather information and to run simple tasks, like powering it on and off.

pi@raspberrypi:~ $ ipmitool -I lanplus -H 172.16.10.67 -U root -P "vmware" lan print 1
Set in Progress : Set Complete
Auth Type Support : NONE MD2 MD5 PASSWORD 
Auth Type Enable : Callback : MD2 MD5 
 : User : MD2 MD5 
 : Operator : MD2 MD5 
 : Admin : MD2 MD5 
 : OEM : 
IP Address Source : Static Address
IP Address : 172.16.10.67
Subnet Mask : 255.255.255.0
MAC Address : b8:ac:6f:92:0b:e9
SNMP Community String : public
IP Header : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
Default Gateway IP : 172.16.10.1
Default Gateway MAC : 00:00:00:00:00:00
Backup Gateway IP : 0.0.0.0
Backup Gateway MAC : 00:00:00:00:00:00
802.1q VLAN ID : Disabled
802.1q VLAN Priority : 0

Above you can see a simple IPMI query using the ipmitool application available for Linux and other operating systems. In this example, I’m pulling the network configuration of the BMC.

I hadn’t had a need to use IPMI with the Dell T110 until recently, but was surprised to see that the BMC was not responding to ping. Thinking the BMC was just hung up, I did a cold power cycle and double checked the configuration. After several frustrating reboots, it became clear to me that FreeNAS/FreeBSD was not playing nicely with the Dell BMC. It appeared that it would work just fine until the FreeBSD kernel used by FreeNAS started loading. As soon as the bge driver claimed the BCM5722 card, the BMC couldn’t be accessed over the network.

To make things even more frustrating, it would not recover when the machine was shut down or rebooted. I could only get the BMC on the network again after doing a cold power-cycle of the server, or after going into the BMC configuration, changing something, and rebooting.

After doing some digging, I came across a thread on the FreeBSD forum that described my symptoms exactly. I’m not the only one who has run into this issue with Dell BMCs and shared Broadcom adapters in FreeBSD. This thread then led me to FreeBSD bug 196944 regarding a regression in the Broadcom bge driver. It looks like this has actually been broken for some time – all the way back to FreeBSD 9.2 – and is still a problem in 11.1 as well.

A few people were able to work around this issue by recompiling the kernel with the Broadcom driver from back in FreeBSD 9.1. I really didn’t feel comfortable doing this level of tinkering with FreeNAS – especially since any subsequent FreeNAS patches would likely just break it again.

Thankfully, someone in comment 6 of the bug describes a potential workaround that involves nothing more than enabling the PXE bootrom of the onboard Broadcom adapter in the BIOS. This was reported as having mixed results on varying models of Dell servers, but I was willing to give it a try. After changing my onboard NIC from ‘Enabled’ to ‘Enabled with PXE’ in the BIOS, the problem disappeared!

If you have this problem – give it a shot. It’s a simple workaround and the only down side is the extra 2-3 seconds at boot up.

NSX 6.3.6 Now Available!

As of March 29th, the long anticipated NSX 6.3.6 release is now available to download from VMware. NSX 6.3.6 with build number 8085122 is a maintenance release and includes a total of 20 documented bug fixes. You can find details on these in the Resolved Issues section of the NSX 6.3.6 release notes.

Aside from bug fixes, there are a couple of interesting changes to note. From the release notes:

“If you have more than one vSphere Distributed Switch, and if VXLAN is configured on one of them, you must connect any Distributed Logical Router interfaces to port groups on that vSphere Distributed Switch. Starting in NSX 6.3.6, this configuration is enforced in the UI and API. In earlier releases, you were not prevented from creating an invalid configuration.”

Since confusion with multiple DVS switches is something I’ve run into with customers in the past, I’m happy to see that this is now being enforced.

Another great addition is an automatic backup function included in 6.3.6. From the public documentation:

“When you upgrade NSX Manager to NSX 6.3.6, a backup is taken and saved locally as part of the upgrade process. You must contact VMware customer support to restore this backup. This automatic backup is intended as a failsafe in case the regular backup fails.”

As part of the upgrade process, a backup file is saved to the local filesystem of the NSX Manager as an extra bit of insurance. It’s important to note, however, that this does not remove the need to backup prior to upgrading. Consider this the backup of last resort in case something goes horribly wrong.

Another point to note is that NSX 6.3.6 continues to be incompatible with upgrades from 6.2.2, 6.2.1 or 6.2.0. You can see VMware KB 51624 for more information, but don’t try it – it won’t work and you’ll be forced to restore from backup. Upgrading to 6.2.9 before going to 6.3.6 is the correct workaround. I covered more about this issue here in a recent post.

There are a number of great bug fixes included in 6.3.6 – far too many for me to cover here, but a couple that I’m really happy to see include:

“Fixed Issue 2035026: Network outage of ~40-50 seconds seen on Edge Upgrade. During Edge upgrade, there is an outage of approximately 40-50 seconds. Fixed in 6.3.6

This one is self-explanatory – not the expected amount of downtime to experience during an edge upgrade, so glad to see it’s been resolved.

“Fixed Issue 2058636: After upgrading to 6.3.5, the routing loop between DLR and ESG’s causes connectivity issues in certain BGP configurations. A routing loop is causing a connectivity issue. Fixed in 6.3.6”

I hope to write a separate post on this one, but in short, some loop prevention code was removed in 6.3.5, and because the AS PATH is stripped with private BGP autonomous systems, this can lead to loops. If you are running iBGP between your DLR and ESGs, this isn’t a problem, but if your AS numbers differ between DLR and ESG, you could run into this. In 6.4.0 a toggle switch was included to avoid stripping the AS PATH, so this is more of an issue in 6.3.5.

As always, if you are planning to upgrade, be sure to thoroughly go through the release notes. I’d also recommend taking a look through my recent post ‘Ten Tips for a Successful NSX Upgrade’.

Links and Downloads:

NSX Troubleshooting Scenario 7 – Solution

Welcome to the seventh installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario seven. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

In the first half of this scenario, we saw that our fictional customer was hitting an exception every time they tried to convert their secondary – now a transit – NSX Manager to the standalone role. The error message seemed to imply that numerous universal objects were still in the environment.

tshoot7a-2

Our quick spot checks didn’t show any lingering universal objects, but looking at the NSX Manager logging can tell us a bit more about what still exists:

2018-03-26 22:27:21.779 GMT  INFO http-nio-127.0.0.1-7441-exec-1 ReplicationConfigurationServiceImpl:152 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Role validation successful
2018-03-26 22:27:21.792 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 1 universal objects exists for type VdnScope
2018-03-26 22:27:21.793 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): Universal TZ
2018-03-26 22:27:21.794 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 1 universal objects exists for type Edge
2018-03-26 22:27:21.797 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): dlr-universal
2018-03-26 22:27:21.798 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:61 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] 5 universal objects exists for type VirtualWire
2018-03-26 22:27:21.806 GMT  INFO http-nio-127.0.0.1-7441-exec-1 DefaultUniversalSyncListener:66 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Some objects are (printing maximum 5 names): Universal Transit, Universal Test, Universal App, Universal Web, Universal DB
2018-03-26 22:27:21.809 GMT  INFO http-nio-127.0.0.1-7441-exec-1 L2UniversalSyncListenerImpl:58 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Global VNI pool exists
2018-03-26 22:27:21.814 GMT  WARN http-nio-127.0.0.1-7441-exec-1 ReplicationConfigurationServiceImpl:101 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] Setting role to TRANSIT because following object types have universal objects VniPool,VdnScope,Edge,VirtualWire
2018-03-26 22:27:21.816 GMT  INFO http-nio-127.0.0.1-7441-exec-1 AuditingServiceImpl:174 - - [nsxv@6876 comp="nsx-manager" subcomp="manager"] [AuditLog] UserName:'LAB\mike', ModuleName:'UniversalSync', Operation:'ASSIGN_STANDALONE_ROLE', Resource:'', Time:'Mon Mar 26 14:27:21.815 GMT 2018', Status:'FAILURE', Universal Object:'false'
2018-03-26 22:27:21.817 GMT  WARN http-nio-127.0.0.1-7441-exec-1 RemoteInvocationTraceInterceptor:88 - Processing of VsmHttpInvokerServiceExporter remote call resulted in fatal exception: com.vmware.vshield.vsm.replicator.configuration.facade.ReplicatorConfigurationFacade.setAsStandalonere.vshield.vsm.exceptions.InvalidArgumentException: core-services:125023:Unable to assign STANDALONE role. Universal objects of following types are present:

If you look closely at the messages above, you can see a list of what still exists. Keep in mind that a maximum of five objects per category is included in the log messages. In this case, they are:

Transport Zones: Universal TZ
Edges: dlr-universal
Logical Switches: Universal Transit, Universal Test, Universal App, Universal Web, Universal DB
VNI Pools: 1 exists

This is indeed a list of everything the customer claims to have deleted from the environment. From the perspective of the ‘Transit’ manager, these objects still exist for some reason.

How We Got Here

Looking back at the order of operations the user did tells us something important:

  1. First, he disconnected the secondary NSX Manager from the primary. This was successful, and it changed its role from Secondary to ‘Transit’.
  2. Next, he attempted to convert it to a ‘Standalone’ manager. This failed with the same error message mentioned earlier. This seemed valid, however, because those objects really did exist.
  3. At this point, he removed the remaining universal logical switches, edges and transport zone. These were all deleted successfully.
  4. The subsequent attempts to convert the manager to a ‘Standalone’ continue to fail with the same error message even though the objects are gone.

Notice the very first step – they disconnected the secondary from the primary NSX Manager. Because it’s in the ‘transit’ role, we know this was achieved by using the ‘Disconnect from primary’ option in the secondary manager’s ‘Actions’ menu. Now, because they were looking to simply remove cross-VC functionality, this was not the appropriate action. What they should have done is used the ‘Remove Secondary NSX Manager’ from the primary’s ‘Actions’ menu. This would move the secondary directly to the standalone role.

Continue reading

NSX Troubleshooting Scenario 7

Welcome to the seventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“I’m in the process of decommissioning a remote location. I’m trying to convert the secondary NSX manager to a standalone, but it fails every time saying that universal objects need to be deleted. I’ve removed all of them and the error persists!”

Well, this seems odd. Let’s look at the environment and try to reproduce the issue to see what the exact error is.

tshoot7a-1

It looks like the 172.19.10.40 NSX manager is currently in Transit mode. This means that it was removed as a secondary at some point but has not been converted to a standalone. This is the operation that is failing:

tshoot7a-2

Continue reading

USB Passthrough and vMotion

I was recently speaking with someone about power management in a home lab environment. Their plan was to use USB passthrough to connect a UPS to a virtual machine in a vSphere cluster. From there, they could use PowerCLI scripting to gracefully power off the environment if the UPS battery got too low. This sounded like a wise plan.

Their concern was that the VM would need to be pinned to the host where the USB cable was connected and that vMotion would not be possible. To their pleasant surprise, I told them that support for vMotion of VMs with USB passthrough had been added at some point in the past and it was no longer a limitation.

When I started looking more into this feature, however, I discovered that this was not a new addition at all. In fact, this has been supported ever since USB passthrough was introduced in vSphere 4 over seven years ago. Have a look at the vSphere Administration Guide for vSphere 4 on page 105 for more information.

I had done some work with remote serial devices in the past, but I’ve never been in a situation where I needed to vMotion a VM with a USB device attached. It’s time to finally take this functionality for a test drive.

Continue reading

NSX Troubleshooting Scenario 6 – Solution

As we saw in the first half of scenario 6, a fictional administrator enabled the DFW in their management cluster, which caused some unexpected filtering to occur. Their vCenter Server was no longer allowed the necessary HTTPS port 443 traffic needed for the vSphere Web Client to work.

Since we can no longer manage the environment or the DFW using the UI, we’ll need to revert this change using some other method.

As mentioned previously, we are fortunate in that NSX Manager is always excluded from DFW filtering by default. This is done to protect against this very type of situation. Because the NSX management plane is still fully functional, we should – in theory – still be able to relay API based DFW calls to NSX Manager. NSX Manager will in turn be able to publish these changes to the necessary ESXi hosts.

There are two relatively easy ways to fix this that come to mind:

  1. Use the approach outlined in KB 2079620. This is the equivalent of doing a factory reset of the DFW ruleset via API. This will wipe out all rules and they’ll need to be recovered or recreated.
  2. Use an API call to disable the DFW in the management cluster. This will essentially revert the exact change the user did in the UI that started this problem.

There are other options, but above two will work to restore HTTP/HTTPS connectivity to vCenter. Once that is done, some remediation will be necessary to ensure this doesn’t happen again. Rather than picking a specific solution, I’ll go through both of them.

Continue reading