ESXi – vswitchzero

Synology Active Backup for Business

Synology’s Active Backup for Business is a powerful, license-free backup tool included with many of their higher end “plus” and rackmount NAS units. Here is my latest video taking a look at its integration with VMware vSphere and ESXi. I walk through installation, setup, backup and restore.

In short, it is an excellent tool that provides features you’d expect to see from enterprise “paid” backup solutions. For those with home labs or smaller environments, it makes the value proposition of buying a Synology NAS much more enticing!

Links:

Synology’s Active Backup for Business page: https://www.synology.com/en-ca/dsm/fe…
Supported NAS models: https://www.synology.com/en-ca/dsm/pa…
vSphere Support: https://www.synology.com/en-ca/knowle…
User access requirements for vSphere: https://www.synology.com/en-us/knowle…

Scheduling Tasks in ESXi Using Cron

The utility “cron” is a job scheduler in Linux/Unix based operating systems. It is very useful for scheduling scripts or specific commands to run on a defined schedule – daily, weekly, monthly and everything in between. Thankfully, ESXi includes an implementation of the cron utility that can be accessed from the root shell. Normally, you wouldn’t need to use cron, but there are some situations where scheduling CLI commands can be useful.

A few use cases where I’ve personally done this over the years include:

To collect switchport statistics at certain times overnight to troubleshoot packet loss or performance issues.
Restarting a specific service every 24 hours to prevent a memory leak from getting out of hand.
Executing a python or shell script to collect various data from the CLI.

ESXi’s implementation of cron is similar to that of most linux distributions, but it is not exactly the same. The popular ‘crontab’ command isn’t included and can’t be used to easily add jobs. In addition, any cron changes you make won’t take effect until the crond service is restarted.

Preparation

Before changing the cron configuration, you’ll want to test the command or script you plan to schedule. In my case, I’m going to simply run an esxcli command every two minutes that will add a mark entry into the system log files:

 [root@esx1:~] esxcli system syslog mark --message="My cron job just ran!"

 [root@esx1:~] cat /var/log/vmkernel.log |grep "mark:"
 2021-02-17T17:39:00Z esxcfg-syslog[2102126]: mark: My cron job just ran!

As you can see above, this is a great way to test out cron because every time it runs you’ll get proof in the logging along with a date/time stamp.

Tip: If you are adding a script to your host, avoid the /tmp location as it is not persistent across reboots. I like to use /opt in older releases, or the OSData partition in ESXi 7.0.

Also, if you are not familiar with crontab formatting, I’d recommend reading up on the subject to make sure your jobs run as expected. There are a number of good resources online that will show you the various scheduling options. I have included a few examples at the end of this post as well.

Adding a Cron Job

You’ll first need to SSH into your ESXi host. Once there, you can see the current crontab file at /var/spool/cron/crontabs/root. In ESXi 7.0, the file contains:

 [root@esx1:~] cat /var/spool/cron/crontabs/root
 #min hour day mon dow command
 1    1    *   *   *   /sbin/tmpwatch.py
 1    *    *   *   *   /sbin/auto-backup.sh
 0    *    *   *   *   /usr/lib/vmware/vmksummary/log-heartbeat.py
 */5  *    *   *   *   /bin/hostd-probe.sh ++group=host/vim/vmvisor/hostd-probe/stats/sh
 00   1    *   *   *   localcli storage core device purge
 */10 *    *   *   *   /bin/crx-cli gc
 */2 * * * * esxcli system syslog mark --message="My cron job just ran!"

As you can see, ESXi already uses cron to schedule several internal housekeeping routines. Before changing the file, be sure to back it up just in case:

cp /var/spool/cron/crontabs/root /var/spool/cron/crontabs/root.old

You can modify the file using ‘vi’. For those not familiar with Linux, there is a bit of a learning curve to vi so I’d recommend reading up on how to navigate around in it. There are quite a few good turorials available online.

[root@esx1:~] vi /var/spool/cron/crontabs/root

Note: When using :wq to save your changes, you’ll likely get a warning that the file is read only. You don’t need to fiddle with the permissions. Simply use :wq! and the file will be written successfully.

I have added a single line at the bottom of the file. Here is the updated root crontab file:

#min hour day mon dow command
 1    1    *   *   *   /sbin/tmpwatch.py
 1    *    *   *   *   /sbin/auto-backup.sh
 0    *    *   *   *   /usr/lib/vmware/vmksummary/log-heartbeat.py
 */5  *    *   *   *   /bin/hostd-probe.sh ++group=host/vim/vmvisor/hostd-probe/stats/sh
 00   1    *   *   *   localcli storage core device purge
 */10 *    *   *   *   /bin/crx-cli gc
 */2  *    *   *   *   esxcli system syslog mark --message="My cron job just ran!"

Note: As mentioned previously, if you are not familiar with the min/hour/day/mon/dow formatting that cron uses, there are a number of good resources online that can help.

Despite updating the file, your changes will not take effect until the crond service is restarted on the host. First, get the crond PID (process identifier) by running the following command:

[root@esx1:~] cat /var/run/crond.pid
2098663

Next, kill the crond PID. Be sure to change the PID number to what you obtained in the previous step.

[root@esx1:~] kill 2098663

Once the process is stopped, you can use BusyBox to launch it again:

[root@esx1:~] /usr/lib/vmware/busybox/bin/busybox crond

You’ll know it was restarted successfully if you have a new PID now:

[root@esx1:~] cat /var/run/crond.pid
2103414

After leaving the host idle for a few minutes, you can see the command has been running every two minutes as desired:

[root@esx1:~] cat /var/log/vmkernel.log |grep -i mark:
2021-02-17T17:39:00Z esxcfg-syslog[2102126]: mark: My cron job just ran!
2021-02-17T20:16:00Z esxcfg-syslog[2103370]: mark: My cron job just ran!
2021-02-17T20:18:00Z esxcfg-syslog[2103382]: mark: My cron job just ran!
2021-02-17T20:20:00Z esxcfg-syslog[2103396]: mark: My cron job just ran!

As you can imagine, the possibilities are endless here. I will share some of the scripts I have used to collect some performance metrics via cron in a future post.

Crontab Examples

Run a command every two minutes:

#min hour day mon dow command
*/2    *    *   *   *   esxcli system syslog mark --message="My cron job just ran!"

Run a command every hour:

#min hour day mon dow command
*    */1    *   *   *   esxcli system syslog mark --message="My cron job just ran!"

Run a command at midnight every night:

#min hour day mon dow command
00    0    *   *   *   esxcli system syslog mark --message="My cron job just ran!"

Run a command at 3:30PM every Thursday:

#min hour day mon dow command
30    15    *   *   5   esxcli system syslog mark --message="My cron job just ran!"

Run a command at midnight and at noon every day:

#min hour day mon dow command
00    0,12    *   *   *   esxcli system syslog mark --message="My cron job just ran!"

Overheating NVMe Flash Drives

I recently deployed an all-NVMe based vSAN configuration in my home lab. I’ll be posting more information on my setup soon, but I decided to use OEM Samsung based SSDs. I’ve got 256GB SM961 MLC based drives for my cache tier, and larger 1TB enterprise-grade PM953s for capacity. These drives are plenty quick for vSAN and can be had for great prices on eBay if you know where to look.

nvmeheat-1 — The Samsung Polaris based SM961 is similar to the 960 Pro and well suited for vSAN caching.

Being OEM drives, they don’t have any heatsinks and are pretty bare. As I started running some performance tests using synthetic tools like Crystal Disk Mark and ATTO, I began to see instability. My guest running the test would completely hang after a few minutes of testing and I’d be forced to reboot the ESXi host to recover.

Looking through the logs, it became clear what had happened:

2019-08-16T15:43:26.083Z cpu0:2341677)nvme:AsyncEventReportComplete:3050:Smart health event: Temperature above threshold
2019-08-16T15:43:26.087Z cpu9:2097671)nvme:NvmeExc_ExceptionHandlerTask:317:Critical warnings detected in smart log [2], failing controller
2019-08-16T15:43:26.087Z cpu9:2097671)nvme:NvmeExc_RegisterForEvents:370:Async event registration requested while controller is in Health Degraded state.

One of my nvme drives had overheated! The second time I tried the test, I watched more closely.

Sure enough, it wasn’t the older PM953s overheating, but the newer Polaris based SM961 cache drives. As soon as the heavy writes started, the drive’s temperature steadily increased until it approached 70’C. The moment it hit 70, the guest hung. Looking more closely in ESXi, I could see that the drive completely disappeared. I.e. it was no longer listed as a NVMe device or HBA in the system. It appears that this is safety measure to stop the controller from cooking itself to the point of permanent damage. Since I had no idea it was running so hot, I’d say I’m thankful for this feature – but none the less, I’d have to figure out some way to keep these drives cooler.

ESXi has a limited implementation of SMART monitoring and can pull a few specific metrics. Thankfully, drive temperature is one of them. First, I needed to get the t10 identifier for my nvme drives:

[root@esx-e1:~] esxcli storage core device list |grep SAMSUNG
t10.NVMe____SAMSUNG_MZVPW256HEGL2D000H1______________6628B171C9382499
Display Name: Local NVMe Disk (t10.NVMe____SAMSUNG_MZVPW256HEGL2D000H1______________6628B171C9382499)
Devfs Path: /vmfs/devices/disks/t10.NVMe____SAMSUNG_MZVPW256HEGL2D000H1______________6628B171C9382499
Model: SAMSUNG MZVPW256
t10.NVMe____SAMSUNG_MZ1LV960HCJH2D000MU______________1505216B24382888
Display Name: Local NVMe Disk (t10.NVMe____SAMSUNG_MZ1LV960HCJH2D000MU______________1505216B24382888)
Devfs Path: /vmfs/devices/disks/t10.NVMe____SAMSUNG_MZ1LV960HCJH2D000MU______________1505216B24382888
Model: SAMSUNG MZ1LV960

Running a four second refresh interval using ‘watch’ is a useful way to monitor the drive under stress.

[root@esx-e1:~] watch -n 4 "esxcli storage core device smart get -d t10.NVMe____SAMSUNG_MZVPW256HEGL2D000H1______________6628B171C9382499"
Parameter Value Threshold Worst
---------------------------- ----- --------- -----
Health Status OK N/A N/A
Media Wearout Indicator N/A N/A N/A
Write Error Count N/A N/A N/A
Read Error Count N/A N/A N/A
Power-on Hours 974 N/A N/A
Power Cycle Count 62 N/A N/A
Reallocated Sector Count 0 95 N/A
Raw Read Error Rate N/A N/A N/A
Drive Temperature 35 70 N/A
Driver Rated Max Temperature N/A N/A N/A
Write Sectors TOT Count N/A N/A N/A
Read Sectors TOT Count N/A N/A N/A
Initial Bad Block Count N/A N/A N/A

As you can see, the maximum temperature is listed as 70’C. This isn’t a suggestion as I’ve come to learn the hard way.

To get things cooler I decided to move my fans around in my Antec VSK4000 cases. My lab is geared toward silence more than cooling so the airflow near the PCIe slots is pretty poor. I’ve now got a 120mm fan on the side-panel cooling the slots directly. This benefits my Solarflare 10Gbps NICs as well, which can get quite toasty. This helped significantly, but if I leave a synthetic test running long enough, it will eventually get to 70’C again. Clearly, I’ll need to add passive heatsinks to the SM961s if I want to keep them cool in these systems.

Realistically, it’s only synthetic and very heavy write tests that seem to get the temperature climbing to those levels. It’s unlikely that day-to-day use would cause a problem. None the less, I’m going to look into heatsinks for the drives. They can be had for $5-10 on Amazon, so it seems like a small investment for some extra peace of mind.

The morale of the story – keep an eye on your NVMe controller temps!

ipmitool 1.8.11 vib for ESXi

Run ipmitool directly from the ESXi command line instead of having to boot to Linux.

I just created a packaged vib that includes ipmitool 1.8.11 that can be run directly from the ESXi CLI. I needed to be able to modify fan thresholds to keep my slow-spinning fans from triggering critical alarms on my hosts. These fan thresholds aren’t exposed in the web UI and I have to modify them using ipmitool. Normally, to do this I’d have to shut down the host, and boot it up using an install of Debian on a USB stick – a bit of a pain. Why not just run ipmitool from directly within ESXi instead?

You can find the vib download, some background, installation instructions and example uses on the static page here.

Manually Patching an ESXi Host from the CLI

Manually patching standalone ESXi hosts without access to vCenter or Update Manager using offline bundles and the CLI.

There are many different reasons you may want to patch your ESXi host. VMware regularly releases bug fixes and security patches, or perhaps you need a newer build for compatibility with another application or third-party tool.

Update 3/15/2021: See my video tutorial on how to update your ESXi 7.x host from the CLI:

In my situation, the ESXi 6.7 U1 ESXi hosts (build 10302608) are not compatible with NSX-T 2.4.0, so I need to get them patched to at least 6.7 EP06 (build 11675023).

hostupgcli-1

Before you get started, you’ll want to figure out which patch release you want to update to. There is quite often some confusion surrounding the naming of VMware patch releases. In some cases, a build number is referenced, for example, 10302608. In other cases, a friendly name is referenced – something like 6.7 EP06 or 6.5 P03. The EP in the name denotes an ‘Express Patch’ with a limited number of fixes released outside of the regular patch cadence, where as a ‘P’ release is a standard patch. In addition to this, major update releases are referred to as ‘U’, for example, 6.7 U1. And to make things more confusing, a special ‘Release Name’ is quite often referenced in security bulletins and other documents. Release names generally contain the release date in them. For example, ESXi670-201903001 for ESXi 6.7 EP07.

The best place to start is VMware KB 1014508, which provides links to numerous KB articles that can be used for cross referencing build numbers with friendly versions names. The KB we’re interested in for ESXi is KB 2143832.

Continue reading “Manually Patching an ESXi Host from the CLI”

Updating NIC Drivers with VMware Update Manager

Using VUM and DRS to make quick work of driver updates in larger environments.

In my last video, I showed you how to update ESXi NIC drivers from the command line. This method is great for one-off updates, or for small environments, but it really isn’t scalable. Thankfully, VMware Update Manager can make quick work out of driver updates. By taking advantage of fully-automated DRS, VUM can make the entire process seamless and orchestrate everything from host evacuation, driver installation and even the host reboots.

In today’s video, I walk you through how to upload a custom patch into VUM and create a baseline that can be used to update a driver.

Remember, some server vendors require specific or minimum firmware levels to go along with their drivers. The firmware version listed in the compatibility guide is only the version used to test/qualify the driver. It’s not necessarily the best or only choice. VMware always recommends reaching out to your hardware vendor for the final word on driver/firmware interoperability.

I hope you found this video helpful. For more instructional videos, please head over to my YouTube channel. Please feel free to leave any comments below, or on YouTube.

Updating NIC Drivers in ESXi from the CLI

A video walk-through on updating your NIC drivers from the command line for maximum control.

There are a number of reasons you may want to update your NIC drivers and firmware. Maybe it’s just a best practice recommendation from the vendor, or perhaps you’ve run into a bug or performance problem that warrants this. Whatever the reason, keeping your NIC drivers up to date is always a good idea.

There are several ways to go about updating your drivers, but the tried and tested ‘esxcli’ method works well for small environments. It’s also a good choice to ensure you have maximum control over the process. The below video will walk you through the update process:

Remember that finding the correct NIC on the VMware Compatibility Guide is one of the most important steps in the driver update process. For help on narrowing down your exact NIC make/model based on PCI identifiers, be sure to check out this video.

Another important point to remember is that some server vendors require specific or minimum firmware levels to go along with their drivers. The firmware version listed in the compatibility guide is only the version used to test/qualify the driver. It’s not necessarily the best or only choice. VMware always recommends reaching out to your hardware vendor for the final word on driver/firmware interoperability.

Stay tuned for another video on using VMware Update Manager to create a baseline for automating the driver update process!

I hope you found this video helpful. For more instructional videos, please head over to my YouTube channel. Please feel free to leave any comments below, or on YouTube.

Identifying NICs based on PCI VID and DID

A better way to find your exact NIC model on the VMware Compatibility Guide.

If you’ve ever tried to search for a NIC in the VMware Compatibility Guide, you may have come up a much longer list of results than you expected. Many cards with similar names have subtle differences. Some have multiple hardware revisions, varying numbers or types of ports and may also be released by different OEMs. In some situations, the name of the card in the vSphere UI may not match what it truly is, adding to the confusion.

Thankfully, there is a much better way to identify your card. You can use the PCI VID, DID, SVID and SSID identifiers. The below video will walk through how to find these identifiers, as well as how to use them to find your specific card on the HCG.

Please feel free to leave any comments or questions below or on YouTube.

Console Mouse Not Working in Windows VMs

I recently ran into some problems while deploying a Windows Server 2012 R2 VM in my vSphere 6.5 U2 lab. I’ve come to expect that the console mouse response is going to be terrible until VMware Tools is installed, but for some odd reason I had no mouse control whatsoever. Thinking it may be a quirk of the Web Console, I tried both the Remote Console and the HTML5 client to no avail.

The VM appeared to be healthy and would register keyboard input, but the motion of the mouse cursor was erratic or the cursor would not move at all. Thinking that I just needed to battle on and get Tools installed, I attempted to use the keyboard for this purpose – what a chore. You think it would have been easy, but the installer kept losing focus and falling behind other open windows. Many of the windows keyboard shortcuts I’d normally use were not functioning because they register on my laptop – not in the console. I couldn’t RDP to the VM either because the NIC needed to be configured with a valid IP address.

After doing a bit of research, it appeared that display scaling could cause all sorts of mouse issues – but this didn’t appear to be applicable in my case. That’s when I stumbled upon a communities thread that mentioned adding a USB controller to the VM. Even though my VM was ‘Hardware Version 13’, the USB 2.0 controller isn’t added by default.

I managed to get to the device manager using the keyboard, and you can see that the virtual hardware will use a PS/2 a mouse in the absence of a USB controller:

consolemouse-2

I then went ahead and added the basic USB 2.0 controller to the VM and booted it up.

Continue reading “Console Mouse Not Working in Windows VMs”

Certificate Error During Datastore Upload

I have recently rebuilt my home lab – an all too common occurrence due to the number of times I intentionally try to break things. In the process of rebuilding, I had some ISO files I wanted to copy over to a datastore. The process failed and the Web Client greeted me with an uncharacteristically long error message.

The exact text reads:

“The operation failed for an undetermined reason. Typically, this problem occurs due to certificates that the browser does no trust. If you are using self-signed or custom certificates, open the URL below in a new browser tab and accept the certificate, then retry the operation.”

In my case, the URL that it listed was to one of my ESXi hosts in the compute-a cluster called esx-a2. The error then goes on to reference VMware KB 2147256.

It may seem odd that the vSphere Client would be telling you to visit a random ESXi host’s UI address when you are trying to upload a file via vCenter. But if you stop to think about it for a second, vCenter has no access whatsoever to your datastores. Whether you are trying to create a new VMFS datastore, upload a file or even just browse, vCenter must rely on an ESXi host with the necessary access to do the actual legwork. That ESXi host then relays the information back through the Web Client.

Continue reading “Certificate Error During Datastore Upload”