VM Network Performance and CPU Scheduling

Over the years, I’ve been on quite a few network performance cases and have seen many reasons for performance trouble. One that is often overlooked is the impact of CPU contention and a VM’s inability to schedule CPU time effectively.

Today, I’ll be taking a quick look at the actual impact CPU scheduling can have on network throughput.

Testing Setup

To demonstrate, I’ll be using my dual-socket management host. As I did in my recent VMXNET3 ring buffer exhaustion post, I’ll be testing with VMs on the same host and port group to eliminate bottlenecks created by physical networking components. The VMs should be able to communicate as quickly as their compute resources will allow them.

Physical Host:

  • 2x Intel Xeon E5 2670 Processors (16 cores at 2.6GHz, 3.3GHz Turbo)
  • 96GB PC3-12800R Memory
  • ESXi 6.0 U3 Build 5224934

VM Configuration:

  • 1x vCPU
  • 1024MB RAM
  • VMXNET3 Adapter (1.1.29 driver with default ring sizes)
  • Debian Linux 7.4 x86 PAE
  • iperf 2.0.5

The VMs I used for this test are quite small with only a single vCPU and 1GB of RAM. This was done intentionally so that CPU contention could be more easily simulated. Much higher throughput would be possible with multiple vCPUs and additional RX queues.

The CPUs in my physical host are Xeon E5 2670 processors clocked at 2.6GHz per core. Because this processor supports Intel Turbo Boost, the maximum frequency of each core will vary depending on several factors and can be as high as 3.3GHz at times. To take this into consideration, I will test with a CPU limit of 2600MHz, as well as with no limit at all to show the benefit this provides.

To measure throughput, I’ll be using a pair of Debian Linux VMs running iperf 2.0.5. One will be the sending side and the other the receiving side. I’ll be running four simultaneous threads to maximize throughput and load.

I should note that my testing is far from precise and is not being done with the usual controls and safeguards to ensure accurate results. This said, my aim isn’t to be accurate, but rather to illustrate some higher-level patterns and trends.

Continue reading “VM Network Performance and CPU Scheduling”

NSX Engineering Mode ‘root shell’ Access Now Available to Customers

In an interesting move, VMware has released public KB 2149630 on September 29th, providing information on how to access the root shell of the NSX Manager appliance.

If you’ve been on an NSX support call with VMware dealing with a complex issue, you may have seen your support engineer drop into a special shell called ‘Engineering Mode’. This is sometimes also referred to as ‘Tech Support Mode’. Regardless of the name used, this is basically a root bash shell on the underlying Linux based appliance. From here, system configuration files and scripts as well as most normal Linux functions can be accessed.

Normally, when you open a console or SSH session to NSX manager, you are dropped into a restricted ‘admin’ shell with a hierarchical system of commands like Cisco’s IOS. For the majority of what an administrator needs to do, this is sufficient. It’s only in more complex cases – especially when dealing with issues in the Postgres DB – or issues with the underling OS that this may be required.

There are several important statements and disclaimers that VMware makes in this KB article that I want to outline below:

“Important: Do not make any changes to the underlying system without the help of VMware Technical Support. All such changes are not supported and as a result, your system may no longer be supportable by GSS.”

In NSX 6.3.2 and later, you’ll also be greeted by the following disclaimer:

“Engineering Mode: The authorized NSX Manager system administrator is requesting a shell which is able to perform lower level unix commands/diagnostics and make changes to the appliance. VMware asks that you do so only in conjunction with a support call to prevent breaking your virtual infrastructure. Please enter the shell diagnostics string before proceeding.Type Exit to return to the NSX shell. Type y to continue:”

And finally, you’ll want to ensure you have a full backup of NSX Manager should anything need to be modified:

VMware recommends to take full backup of the system before performing any changes after logging into the Tech Support Mode.

Although it is very useful to take a ‘read only’ view at some things in the root shell, making any changes is not supported without getting direct assistance from VMware support.

A few people have asked whether or not making the root shell password public is a security issue, but the important point to remember is that you cannot even get to a position where you can enter the shell unless you are already logged in as an NSX enterprise administrator level account. For example, the built-in ‘admin’ account. For anyone concerned about this, VMware does allow the root password to be changed. It’s just critical that this password not be lost in case VMware support requires access to the root shell for troubleshooting purposes. More information on this can be found in KB 2149630.

To be honest, I’m a bit torn on this development. As someone who does backline support, I know what kind of damage that can be done from the root shell – even with the best intentions. But at the same time, I see this as empowering. It gives customers additional tools to troubleshoot and it also provides some transparency into how NSX Manager works rather than shielding it behind a restricted shell. I think that overall, the benefits outweigh the risks and this was a positive move for VMware.

When I think back to VI 3.5 and vSphere 4.0 when ESXi was shiny and new, VMware initially took a similar stance. You had to go so far as to type ‘UNSUPPORTED’ into the console to access a shell. Today, everyone has unrestricted root access to the hypervisor. The same holds true for the vCenter appliance – the potential for destruction is no different.

I’d welcome any comments or thoughts. Please share them below!

Controller Disconnect and API Bug in NSX 6.3.3

VMware just announced a new bug discovered in NSX 6.3.3. Those running 6.3.3 or planning to upgrade in the near-term may want to familiarize themselves with VMware KB 2151719.

As you may know, VMware moved from a Debian based distribution for the underlying OS of the NSX controllers to their Photon OS platform. This is why the upgrade process includes the complete redeployment of all three controller nodes.

It appears that a scheduled clean-up script on the controllers used to prevent the partitions from filling is also removing some files required for NSX Manager to communicate and authenticate with the controller via REST API.

Most folks running 6.3.3 in a stable deployment will likely not have noticed, but an event disrupting communication between Manager and the Controllers can prevent them from reconnecting. Some examples would include a reboot of the NSX manager, or a network disruption.

Thankfully, the NSX Controller core functions – managing the VXLAN and distributed logical routing control plane – will continue to work in this state and dataplane disruptions should not be experienced.

KB 2151719 discusses a pretty simple workaround of restarting the api-server service on any impacted controllers. This is a non-disruptive action and should be safe to do at any time. The command is the following:

nsx-controller # restart api-server

VMware will likely be addressing this in the next NSX release. If you are planning to upgrade, you may want to consider 6.3.2 or hold out for the next 6.3.x release.

VMXNET3 RX Ring Buffer Exhaustion and Packet Loss

ESXi is generally very efficient when it comes to basic network I/O processing. Guests are able to make good use of the physical networking resources of the hypervisor and it isn’t unreasonable to expect close to 10Gbps of throughput from a VM on modern hardware. Dealing with very network heavy guests, however, does sometimes require some tweaking.

I’ll quite often get questions from customers who observe TCP re-transmissions and other signs of packet loss when doing VM packet captures. The loss may not be significant enough to cause a real application problem, but may have some performance impact during peak times and during heavy load.

After doing some searching online, customers will quite often land on VMware KB 2039495 and KB 1010071 but there isn’t a lot of context and background to go with these directions. Today I hope to take an in-depth look at VMXNET3 RX buffer exhaustion and not only show how to increase buffers, but to also to determine if it’s even necessary.

Rx Buffering

Not unlike physical network cards and switches, virtual NICs must have buffers to temporarily store incoming network frames for processing. During periods of very heavy load, the guest may not have the cycles to handle all the incoming frames and the buffer is used to temporarily queue up these frames. If that buffer fills more quickly than it is emptied, the vNIC driver has no choice but to drop additional incoming frames. This is what is known as buffer or ring exhaustion.

Continue reading “VMXNET3 RX Ring Buffer Exhaustion and Packet Loss”

Building a Retro Gaming Rig – Part 2

In Part 1 of this series, I took a close look at the Asus P2B slot-1 motherboard and some CPU options. Today, the focus will be on graphics cards.

Evaluating Video Card Options

If this were a pure a DOS build, the card I chose wouldn’t matter very much. I’d simply be interested in something with decent 2D image quality and enough VRAM to support the resolutions I’d want to use. The majority of DOS games don’t offer 3D acceleration but because I wanted a system that would run some Windows 9x based titles, a 3D card would be ideal for that genuine experience.

The years leading up to the millennium were very exciting from a graphics hardware perspective. In 1998, 3D accelerated cards were much more accessible to the average consumer and major strides were made in performance and price.

3dfx Interactive

If you were at all into PC gaming or PC hardware back in the late nineties, you will be well acquainted with 3dfx Interactive. Although the company went bankrupt in 2002, 3dfx was the pioneer in mainstream 3D gaming technology up until that point. The card that brought them to fame was their 3D-only ‘Voodoo Graphics’ adapter released in 1996. They were also well known for their proprietary Glide API that many games of the era supported. The turning point for me personally was when I first experienced GLQuake on a Voodoo card back in 1996. I remember being totally blown away after seeing the surreal lighting and 3D effects and was determined to buy one. After saving up while working a summer job in 1997, I bought the 6MB Canopus Pure 3D card based on the famous chipset and never regretted it for a moment. I used that card for several years until buying a Voodoo 3 3000 later on.

The 3dfx Voodoo Banshee

Being that my retro rig was supposed to be of a 1998 vintage, I had originally started looking at the Voodoo 2 released that year, which was superior to the original in every respect. My next choice was the AGP version of the 3dfx Voodoo 3. Unfortunately, both of these models were fetching well over $100 on eBay, and outside of my budget for this weekend project. They are simply in high demand and are genuine collectors items at this point.

Undeterred, I shifted focus to another part that was a bit less popular but still had all the 3dfx flare of 1998 – the 3dfx Voodoo Banshee. The Banshee was a first for 3dfx in two respects – it was their first single chip 2D/3D solution and their first AGP based card. The 3D portion of the core is almost the same as the popular Voodoo 2, but had only one texture mapping unit instead of two. The resulting performance drop was partially offset by a higher core and memory clock speed but it still had plenty of power for games of that generation. As an added bonus, the 2D performance and image quality of the Banshee was excellent – almost as good as the top notch 2D cards of the time and removed the need for a separate 2D and 3D card in the system.

Today, the Banshee cards don’t sell for quite as much as the Voodoo 2 or Voodoo 3, but can still go for between $60 and $120 on eBay.

3dfx1-1

I was fortunate enough to stumble upon an awesome deal on a 16MB Ensoniq brand Banshee card for only $25 USD. It was probably only still available because the eBay listing didn’t include the words ‘3dfx’ or ‘Voodoo’ in the description. After the shipping, duty and exchange I think it came to under $50 CDN. Not too bad for a rare piece of history.

Ensoniq no longer exists, but was best known for their audio products. They were purchased by Creative Labs shortly after this card was produced. After seeing some of the other Banshee models from other companies, I actually like this card best because of the larger heatsink on it. Most of the other cards have rather small chipset style coolers. They all seem to be passive, but I can confirm that these cards do get rather toasty under load.

3dfx1-2

In 1998, dual monitor outputs and DVI were reserved only for elite workstation cards and specialty products. Just a simple VGA-out can be found on the Ensoniq Banshee. Some cards of this era differentiated themselves with TV S-Video and Composite outputs, but nothing that would interest anyone today.

It was interesting to see two mysterious headers on the card. One is a 40-pin female IDE-style connector, and the other is a 26 pin header of some sort. My first thought was that they may have been for future SLI support, but after doing some digging these turned out to be standard feature connectors. These were found on many cards of this vintage for daughter boards and other add-on modules. Technically, the 36 pin SLI connector found on Voodoo 2 cards is indeed a feature connector of sorts as it allows the cards to communicate over a high-bandwidth channel without having to saturate the PCI bus.

AGP Variations

The earliest 3D cards were all PCI based, but in 1998 many first generation AGP variants were on the market, including the Banshee. A lot of people building retro rigs seem to forget that there were actually several AGP ‘versions’ to hit the market over the years and not all were backward compatible. An AGP 1x card released in 1998 very likely won’t work or even fit in an AGP board released a few years later.

retro-slot1-23

AGP 1x and 2x are very similar and provide power to the card at 3.3V. AGP 4x and 8x came out a few years later and were limited to 1.5V. As you can imagine, putting 3.3V through a card designed for 1.5V could be pretty catastrophic. To prevent these problems, slot and card ‘keying’ were implemented to ensure you couldn’t physically insert a card into an incompatible slot. As you can see above, the 3.3V AGP 1x slot on the ASUS P2B has a plastic ‘key’ about a third of the way in from the left of the slot. A 1.5V 4x/8x slot would have that same key, but in a different position closer to the right of the slot.

Interestingly, some cards that came out later on were actually compatible with both 1.5V and 3.3V AGP slots and had two key indentations on the card. The card in the image above is a 2003 era ATI Radeon 9200 SE. I have used this same card in a newer 8x slot machine as well as the 1x ASUS P2B.

The budget priced Radeon 9200 SE is actually about 9 times more powerful than the Banshee. I had considered using this Windows 98 compatible card in my retro build, but it just didn’t have the same nostalgic value that a genuine 3dfx card does. For now, it’s been serving as a spare.

Conclusion

I was really excited to get a 3dfx card – especially without having to shell out too much. It’ll be great to use a card from that era and to use the 3dfx proprietary glide API for many games as well. I think that the Banshee is a perfect match for the build and I look forward to getting everything put together!

Next up, I’ll take a look at a couple of sound card options, including both ISA and PCI cards. In an added twist, I also bought a compact micro ATX socket 370 system that I’ll evaluate as well.

Part 3 – The mATX MSI 6160 and BIOS Woes >>

Building a Retro Gaming Rig Series

Part 1 – ASUS P2B and Slot-1 CPUs
Part 2 – 3D Video Card Options
Part 3 – The mATX MSI MS-6160 and BIOS Woes
Part 4 – Choosing a Sound Card
Part 5 – Hard Drive/Storage Options
Part 6 – Completed Build!

Using FreeNAS for NSX FTP Backups

FreeNAS is a very powerful storage solution and is quite popular with those running vSphere and NSX home labs. I recently built a new FreeNAS 9.10 system and wanted to share some of my experiences getting NSX FTP backups going.

To get this configured, I found the FTP section of the FreeNAS 9.10 documentation to be very useful. I’d definitely recommend giving it a read through as well.

Before Getting Started

Before enabling the FTP service in FreeNAS, you’ll want to decide where to put your NSX backups. In theory, you can dump them in any of your volumes or datasets but you may want to set aside a specific amount of storage space for them. To do this in my lab, I created a dedicated dataset with a 60GB quota for FTP purposes. I like to separate it out to ensure nothing else competes with the backups and the amount of space available is predictable.

FreeNASNSXbackups-1

If you plan to use FTP for more than just NSX, it would be a good idea to create a subdirectory in the dataset or other location you want them to reside. In my case, I created a directory called ‘NSX’ in the dataset:

[root@freenas] ~# cd /mnt/vol1/dataset-ftp
[root@freenas] /mnt/vol1/dataset-ftp# mkdir NSX
[root@freenas] /mnt/vol1/dataset-ftp# ls -lha
total 2
drwxr-xr-x 3 root wheel 3B Sep 7 09:23 ./
drwxr-xr-x 5 root wheel 5B Sep 5 16:13 ../
drwxr-xr-x 2 root wheel 2B Sep 7 09:23 NSX/
[root@freenas] /mnt/vol1/dataset-ftp#

Setting Permissions

One step that is often missed during FreeNAS FTP configuration is to set the appropriate permissions. The proftpd service in FreeNAS uses the built in ftp user account. If that user does not have the appropriate permissions to the location you intend to use, backups will not write successfully.

Since I used a dedicated dataset for FTP called dataset-ftp, I can easily set permissions recursively for this location from the UI:

FreeNASNSXbackups-2

As shown above, we want to set both the owner user and group to ftp. Because I created the NSX directory within the dataset, I’ll be setting permission recursively as well.

If I log into FreeNAS via SSH or console again, I can confirm that this worked because the dataset-ftp mount is now owned by ftp as is the NSX subdirectory within.

[root@freenas] /mnt/vol1# ls -lha
total 14
drwxrwxr-x  5 root  wheel     5B Sep  5 16:13 ./
drwxr-xr-x  4 root  wheel   192B Sep  5 16:09 ../
drwxr-xr-x  3 ftp   ftp       3B Sep  7 09:23 dataset-ftp/
drwxrwxr-x  5 root  wheel    13B Jul 29 16:23 dataset-static/
drwxrwxr-x  2 root  wheel     2B Sep  5 16:13 dataset-tftp/
[root@freenas] /mnt/vol1# ls -lha dataset-ftp
total 2
drwxr-xr-x  3 ftp   ftp       3B Sep  7 09:23 ./
drwxrwxr-x  5 root  wheel     5B Sep  5 16:13 ../
drwxr-xr-x  2 ftp   ftp       2B Sep  7 13:46 NSX/

The Easy Option – Anonymous FTP Access

Setting up anonymous FTP access requires the least amount of effort and is usually sufficient for home lab purposes. I would strongly discourage the use of anonymous access in a production or security sensitive environment as anyone on the network can access the FTP directory configured.

First, configure FTP under services in the FreeNAS UI:

FreeNASNSXbackups-4

As you’d obviously expect, the ‘Allow Anonymous Login’ option needs to be checked off in order for anonymous FTP to work. The ‘Allow Local Users Login’ option should be unchecked if you don’t want to use authentication. It’s also important to select the ‘Path’ to the FTP root directory you wish to use. In my example above, any anonymous FTP logins will go directly into the NSX subdirectory I created earlier.

If you want to use FTP for more than just NSX backups, you can make the path the root of the dataset and NSX can be configured to use a specific subdirectory within as I’ll show later.

Once that’s done, you can enable the FTP service. It’ll be off by default:

FreeNASNSXbackups-5

Now we can do some basic tests to ensure FTP is functional. You can use an FTP client like FileZilla if you like, but I’m just going to use the good old Windows FTP command line utility. First, let’s make sure we can login anonymously:

C:\Users\mike.LAB\Desktop>ftp freenas.lab.local
Connected to freenas.lab.local.
220 ProFTPD 1.3.5a Server (freenas.lab.local FTP Server) [::ffff:172.16.10.17]
User (freenas.lab.local:(none)): anonymous
331 Anonymous login ok, send your complete email address as your password
Password:
230 Anonymous access granted, restrictions apply

A return status of 230 is what we’re looking for here and this seems to work fine. Keep in mind that it technically doesn’t matter what password you enter for the anonymous username. You can just hit enter, but I usually just re-enter the username. It’s not necessary to enter anything that resembles an email address.

Next, let’s make sure we have permission to write to this location. I’ll do an FTP ‘PUT’ of a small text file:

ftp> bin
200 Type set to I
ftp> put C:\Users\mike.LAB\Desktop\test.txt
200 PORT command successful
150 Opening BINARY mode data connection for test.txt
226 Transfer complete
ftp: 14 bytes sent in 0.00Seconds 14000.00Kbytes/sec.
ftp> dir
200 PORT command successful
150 Opening ASCII mode data connection for file list
-rw-r----- 1 ftp ftp 14 Sep 7 13:56 test.txt
226 Transfer complete
ftp: 65 bytes received in 0.01Seconds 10.83Kbytes/sec.
ftp>

As seen above, the file was written successfully with a 226 return code. The last step I’d recommend doing before configuring NSX is to confirm the relative path after login from the FTP server’s perspective. Because I stayed in the FTP root directory, it simply lists a forward slash as shown below:

ftp> pwd
257 "/" is the current directory
ftp>

NSX expects this path as you’ll see shortly. Now that we know anonymous FTP is working, we can configure the FTP server from the NSX appliance UI:

FreeNASNSXbackups-3

As you can see above, I’ve entered ‘anonymous’ as the user name, and entered the same as the password string. The backup directory is the location you want NSX to write backups to. If you had a specific directory you wanted to use within the FTP root directory that was configured, you could enter it here. For example, /backups. As mentioned earlier, my FTP root directory is the NSX directory so it’s not necessary in my case.

Two other pieces of information are mandatory in NSX – the filename prefix and the pass phrase. The filename prefix is just that – a string that is appended to the beginning of the filename. It usually makes sense to identify the environment or NSX manager by name here. This is especially important if you have multiple NSX managers all backing up to one location. The pass phrase is a password used to encrypt the backup binary file generated. Be sure not to lose this or you will not be able to restore your backups.

After hitting OK, we can then do a quick backup to ensure it can connect and write to the location configured.

FreeNASNSXbackups-6

If everything was successful, you should then see your file listed in the backup history pane at the bottom of the view:

FreeNASNSXbackups-7

FTP User Authentication

Anonymous FTP may be sufficient for most home lab purposes, but there are several advantages to configuring users and authentication. FTP by nature transmits in plain text and is not secure, but adding authentication provides a bit more control over who can access the backups and allows the direction of users to specific FTP locations. This can be useful if you plan to use your FTP server for more than just NSX.

Before we begin, let’s create a user in FreeNAS that we’ll use for NSX backups:

FreeNASNSXbackups-8

Some of the key things you’ll need to ensure is that the user’s primary group is the built-in ftp group used by proftpd and that the user’s home directory is where you want them to land after log in. In my example above, I’m creating a user called nsxftpuser with a home directory of the FTP root directory I configured earlier.

Keep in mind that by default FreeNAS will create a new home directory hence the wording “Create Home Directory In:”. I expect the home directory to actually be /mnt/vol1/dataset-ftp/nsxftpuser and not /mnt/vol1/dataset-ftp/.

Next, we need to modify the FTP settings slightly:

FreeNASNSXbackups-9

Since we want to use local user authentication, we need to check ‘Allow Local User Login’. I’ve also unchecked ‘Allow Anonymous Login’ to ensure only authenticated users can now login.

In order to test that we’re dumped into the user’s home directory after login, I changed the FTP default path one level back to the root of the dataset.

As a last step, it’s necessary to stop and start the FTP service again for the changes to take effect.

Before we test this new user, let’s double check that the home directory is located where we want it:

[root@freenas] /mnt/vol1/dataset-ftp# ls -lha
total 18
drwxr-xr-x  4 ftp         ftp       4B Sep  7 14:42 ./
drwxrwxr-x  5 root        wheel     5B Sep  5 16:13 ../
drwxr-xr-x  2 ftp         ftp       5B Sep  7 14:43 NSX/
drwxr-xr-x  2 nsxftpuser  ftp      10B Sep  7 14:42 nsxftpuser/

As you can see above, we now have a home directory matching the username in the FTP root location.

Now let’s try to log in using the nsxftpuser account:

C:\Users\mike.LAB\Desktop>ftp freenas.lab.local
Connected to freenas.lab.local.
220 ProFTPD 1.3.5a Server (freenas.lab.local FTP Server) [::ffff:172.16.10.17]
User (freenas.lab.local:(none)): nsxftpuser
331 Password required for nsxftpuser
Password:
230-Welcome to FreeNAS FTP Server
230 User nsxftpuser logged in
ftp>

So far so good, now let’s PUT a file to ensure we have write access to this location:

ftp> bin
200 Type set to I
ftp> put C:\Users\mike.LAB\Desktop\test.txt
200 PORT command successful
150 Opening BINARY mode data connection for test.txt
226 Transfer complete
ftp: 14 bytes sent in 0.00Seconds 14000.00Kbytes/sec.
ftp> dir
200 PORT command successful
150 Opening ASCII mode data connection for file list
-rw-r----- 1 nsxftpuser ftp 14 Sep 7 14:45 test.txt
226 Transfer complete
ftp: 67 bytes received in 0.00Seconds 22.33Kbytes/sec.

Success! The last thing we need to do is modify the NSX configuration slightly to use the new user account:

FreeNASNSXbackups-10

And sure enough, the backup was successful at 18:47 GMT:

FreeNASNSXbackups-11

If I look at the files from the FreeNAS SSH session, I can see both the encrypted backup binary and metadata properties file located in the user’s home directory:

[root@freenas] /mnt/vol1/dataset-ftp/nsxftpuser# ls -lha lab*
-rw-r-----  1 nsxftpuser  ftp   2.4M Sep  7 14:47 lab18_47_33_Thu07Sep2017
-rw-r-----  1 nsxftpuser  ftp   227B Sep  7 14:47 lab18_47_33_Thu07Sep2017.backupproperties

Scheduling Backups

Once we know NSX backups are functional, it’s a good idea to get them going on a schedule.

An important consideration to keep in mind when deciding when to schedule is when your vCenter backups are done. Because NSX relies heavily upon the state of the vCenter Server inventory and objects, it’s a good idea to try to schedule your backups at around the same time. That way, if you ever need to restore, you’ll have vCenter and NSX objects in sync as closely as possible.

FreeNASNSXbackups-12

In my lab, I have it backing up every night at midnight, but depending on how dynamic your environment is, you may want to do it more frequently.

Another important point to note is that NSX Manager doesn’t handle large numbers of backups very well in the backup directory. The UI will throw a warning once you get up to 100 backups and eventually you’ll get a slow or non-responsive UI in the Backup and Restore section. To get around this, you can manually archive older backups to another location outside of the FTP root directory or create a script to move older files to another location.

The only piece that I haven’t gotten to work with FreeNAS yet is SFTP encrypted backups using TLS. Once I get that going well, I’ll hopefully write up another post on the topic.

Thanks for reading! If you have any questions please leave a comment below.

Removing Stale IP Pool Assignments in NSX

NSX uses the concept of IP pools for IP address assignment for several components including controllers, VTEPs and Guest Introspection. These are normally configured during the initial deployment of NSX and it’s always a good idea to ensure you’ve got some headroom in the pool for future growth.

NSX usually does a good job of keeping track of IP Pool address allocation, but in some situations, stale entries may be wasting IPs. There are a few ways you could get yourself into this situation – most commonly this is due to the improper removal of objects. For example, if an ESXi host is removed from the vCenter inventory while still in an NSX prepared cluster, its VTEP IP address allocation will remain. NSX can’t release the allocation, because the VIBs were never uninstalled and it has no idea what the fate of the host was. If the allocation was released and someone deployed a new host while the old one was still powered on, you’d likely get IP conflicts.

Just this past week, I assisted two separate customers who ran into similar situations – one had a stale IP in their controller pool, and the other had stale IPs in their VTEP pool. Both had removed controllers or ESXi hosts using a non-standard method.

If you have a look in the NSX UI, you’ll notice that there is no way to add, modify or remove allocated IPs. You can only modify or expand the pool. Thankfully, there is a way to remove allocated IPs from a pool using an NSX REST API call.

To simulate a scenario where this can happen, I went ahead and improperly removed one of the NSX controllers and did some manual cleanup afterward. As you can see below, the third controller appears to have been removed successfully.

ippoolAPI-2

When I try to deploy the third controller again, I’m unable to because of a shortage of IPs in the pool:

ippoolAPI-1

If I look at the IP Pool called ‘Controller Pool’ in the grouping objects, I can see that there are only three IPs available and one of them belongs to the old controller than no longer exists:

ippoolAPI-3

So in order to get my third controller re-deployed, I’ll need to either remove the stale 172.16.10.45 entry or expand my pool to have a total of four or more addresses. If this were a production environment, expanding the pool may be a suitable workaround to get things running again quickly. If you are at all like me, simply having this remnant left behind would bother me and I’d want to get it cleaned up.

Releasing IPs Using REST API Calls

Now that we’ve confirmed the IP address we want to nuke from the pool, we can use some API calls to gather the required information and release the address. The API calls we are interested in can be found in the NSX 6.2 and 6.3 API guides. My lab is currently running 6.2.7, so I’ll be using calls found on page 110-114 in the NSX 6.2 API guide.

Before we begin, there are two key pieces of information we’ll need to do this successfully:

  1. The IP address that needs to be released.
  2. The moref identifier of the IP pool in question.

First, we’ll use an API call to query all IP pools on the NSX manager. This will provide an output that will include the moref identifier of the pool in question:

GET https://NSX-Manager-IP-Address/api/2.0/services/ipam/pools/scope/scopeID

As you can see above, the ‘scope ID’ is also required to run this GET call. In every instance I’ve seen, using globalroot-0 as the scopeID works just fine here.

ippoolAPI-4

The various IP pools will be separated by <ipamAddressPool> XML tags. You’ll want to identify the correct pool based on the IP range listed or by the text in the <name> field. The relevant controller pool was identified by the following section in the output in my example:

<ipamAddressPool>
<objectId>ipaddresspool-1</objectId>
<objectTypeName>IpAddressPool</objectTypeName>
<vsmUuid>4226CDEE-1DDA-9FF9-9E2A-8FDD64FACD35</vsmUuid>
<nodeId>fa4ecdff-db23-4799-af56-ae26362be8c7</nodeId>
<revision>1</revision>
<type>
<typeName>IpAddressPool</typeName>
</type>
<name>Controller Pool</name>
<scope>
<id>globalroot-0</id>
<objectTypeName>GlobalRoot</objectTypeName>
<name>Global</name>
</scope>
<clientHandle/>
<extendedAttributes/>
<isUniversal>false</isUniversal>
<universalRevision>0</universalRevision>
<totalAddressCount>3</totalAddressCount>
<usedAddressCount>3</usedAddressCount>
<usedPercentage>100</usedPercentage>
<prefixLength>24</prefixLength>
<gateway>172.16.10.1</gateway>
<dnsSuffix>lab.local</dnsSuffix>
<dnsServer1>172.16.10.10</dnsServer1>
<dnsServer2>172.16.10.11</dnsServer2>
<ipPoolType>ipv4</ipPoolType>
<ipRanges>
<ipRangeDto>
<id>iprange-1</id>
<startAddress>172.16.10.43</startAddress>
<endAddress>172.16.10.45</endAddress>
</ipRangeDto>
</ipRanges>
<subnetId>subnet-1</subnetId>
</ipamAddressPool>

As you can see above, the IP pool is identified by the moref identifier ipaddresspool-1.

As an optional next step, you may wish to view the IP addresses allocated within this pool. The following API call will obtain this information:

GET https://NSX-Manager-IP-Address/api/2.0/services/ipam/pools/poolId/ipaddresses

In my example, I used the following call:

GET https://nsxmanager.lab.local/api/2.0/services/ipam/pools/ipaddresspool-1/ipaddresses

Below is the output I received:

<allocatedIpAddresses>
<allocatedIpAddress>
<id>13</id>
<ipAddress>172.16.10.44</ipAddress>
<gateway>172.16.10.1</gateway>
<prefixLength>24</prefixLength>
<dnsServer1>172.16.10.10</dnsServer1>
<dnsServer2>172.16.10.11</dnsServer2>
<dnsSuffix>lab.local</dnsSuffix>
<subnetId>subnet-1</subnetId>
</allocatedIpAddress>
<allocatedIpAddress>
<id>14</id>
<ipAddress>172.16.10.43</ipAddress>
<gateway>172.16.10.1</gateway>
<prefixLength>24</prefixLength>
<dnsServer1>172.16.10.10</dnsServer1>
<dnsServer2>172.16.10.11</dnsServer2>
<dnsSuffix>lab.local</dnsSuffix>
<subnetId>subnet-1</subnetId>
</allocatedIpAddress>
<allocatedIpAddress>
<id>15</id>
<ipAddress>172.16.10.45</ipAddress>
<gateway>172.16.10.1</gateway>
<prefixLength>24</prefixLength>
<dnsServer1>172.16.10.10</dnsServer1>
<dnsServer2>172.16.10.11</dnsServer2>
<dnsSuffix>lab.local</dnsSuffix>
<subnetId>subnet-1</subnetId>
</allocatedIpAddress>
</allocatedIpAddresses>

Each allocated address in the pool will have its own <id> tag. I can see that 172.16.10.45 is indeed still there. Now let’s remove it using the following API call:

DELETE https://NSX-Manager-IP-Address/api/2.0/services/ipam/pools/poolId/ipaddresses/allocated-ip-address

In my example, the exact call would be:

DELETE https://nsxmanager.lab.local/api/2.0/services/ipam/pools/ipaddresspool-1/ipaddresses/172.16.10.45

ippoolAPI-5

If the call was successful, you should see a Boolean value of ‘true’ returned. Next you can validate again using the previous API call. In my case I used:

GET https://nsxmanager.lab.local/api/2.0/services/ipam/pools/ipaddresspool-1/ipaddresses

And got the following output:

<allocatedIpAddresses>
<allocatedIpAddress>
<id>13</id>
<ipAddress>172.16.10.44</ipAddress>
<gateway>172.16.10.1</gateway>
<prefixLength>24</prefixLength>
<dnsServer1>172.16.10.10</dnsServer1>
<dnsServer2>172.16.10.11</dnsServer2>
<dnsSuffix>lab.local</dnsSuffix>
<subnetId>subnet-1</subnetId>
</allocatedIpAddress>
<allocatedIpAddress>
<id>14</id>
<ipAddress>172.16.10.43</ipAddress>
<gateway>172.16.10.1</gateway>
<prefixLength>24</prefixLength>
<dnsServer1>172.16.10.10</dnsServer1>
<dnsServer2>172.16.10.11</dnsServer2>
<dnsSuffix>lab.local</dnsSuffix>
<subnetId>subnet-1</subnetId>
</allocatedIpAddress>
</allocatedIpAddresses>

As you can see above, the IP with an <id> tag of 15 has been removed. Next, I’ll confirm in the UI that the IP has indeed been released:

ippoolAPI-6

After a refresh of the vSphere Web Client view, the total used decreased to 2 for the Controller Pool and I could deploy my third controller successfully.

Although this process is straight forward if you are familiar with running NSX API calls, I do have to provide a word of caution. NSX will not stop you from releasing an IP if it is genuinely being used. Therefore, it’s important to make 100% sure that whatever object was using the stale IP is indeed off the network. Some basic ping tests are a good idea before proceeding.

Thanks for reading! If you have any questions, please feel free to leave a comment below.

Re-deploying NSX Controllers During Upgrades

I’ve had this question come up a lot lately and there seems to be some confusion around whether or not NSX controllers need to be redeployed after upgrading them. The short answer to this question is really “it depends”. There are actually three different scenarios where you may want or need to delete and re-deploy NSX controllers as part of the upgrade process. Today, I’ll walk through these situations and the proper process to delete and re-deploy your controller nodes.

The Normal Upgrade Process

Upgrading the NSX Control Cluster is a very straight-forward process. After clicking the upgrade link, an automated process begins to upgrade the controller code, and reboot each cluster member sequentially.

controller-redeploy-3

Once the ‘Upgrade Available’ link is clicked, you’ll see each of the three controllers download the upgrade bundle, upgrade and then reboot before NSX moves on to the next one.

controller-redeploy-7

Once NSX goes through its paces, it’s usually a good idea to ensure that the control-cluster join status is ‘Join complete’ and that all three controllers agree on the Cluster UUID.

nsx-controller # show control-cluster status
Type Status Since
--------------------------------------------------------------------------------
Join status: Join complete 07/24 13:38:32
Majority status: Connected to cluster majority 07/24 13:38:19
Restart status: This controller can be safely restarted 07/24 13:38:48
Cluster ID: f2849ee2-7fb6-4aca-abf4-2ca176337956
Node UUID: f2849ee2-7fb6-4aca-abf4-2ca176337956

Role Configured status Active status
--------------------------------------------------------------------------------
api_provider enabled activated
persistence_server enabled activated
switch_manager enabled activated
logical_manager enabled activated
directory_server enabled activated

Because the underling structure of the VM itself doesn’t change, this sort of in-place code upgrade and reboot is sufficient and has minimal impact.

Scenario 1 – E1000 vNIC Replacement

The first scenario where you may want to redeploy the controllers involves a virtual hardware change that was introduced in NSX 6.1.5. NSX controllers deployed in 6.1.5 use the VMXNET3 vNIC adapter, whereas older versions had legacy Intel E1000 emulated vNICs. This change wasn’t well publicized and surprisingly it isn’t even found in the NSX 6.1.5 release notes.

I’ve seen quite a few customers go through upgrade cycles from 6.0 or 6.1 all the way to more recent 6.2.x or 6.3.x releases while retaining E1000 vNICs on their controllers. Although the E1000 vNIC adapter is generally pretty stable, there is at least one documented issue where the adapter driver suffers a hang and the controller is no longer able to transmit or receive. This problem is described in VMware KB 2150747.

That said, I personally would not wait for a problem to occur and would recommend checking to ensure your controllers are using VMXNET3, and if not, go through the redeployment procedure I’ll outline later in this post. Aside from preventing the E1000 hang problem, you’ll also benefit from the other improvements VMXNET3 has to offer like better offloading and lower CPU utilization.

Unfortunately, finding out if your controllers have E1000 or VMXNET3 adapters can be a tad tricky. You’ll find that your controllers are locked down and can’t be edited in the vSphere Web Client or the legacy vSphere Client.

controller-redeploy-1

As seen above, the ‘Edit Settings’ option is greyed out.

controller-redeploy-2

The summary page also doesn’t tell us much, so the easiest way to get the adapter type is to check from the ESXi command line.

First, let’s SSH into a host where one of the controllers live and then find the full path to the VMX file:

[root@esx0:~] cd /vmfs/volumes
[root@esx0:/vmfs/volumes] find ./ -name NSX_Controller*.vmx
./58f77a6f-30961726-ac7e-002655e1b06c/NSX_Controller_078fcf78-9a0c-491d-95a0-02e8b5175935/NSX_Controller_078fcf78-9a0c-491d-95a0-02e8b5175935.vmx

Next, I will look for the relevant vNIC adapter settings in the VMX file using the full path obtained in the previous command output:

[root@esx0:/vmfs/volumes] cat ./58f77a6f-30961726-ac7e-002655e1b06c/NSX_Controller_c97459f1-3845-436f-8e03-60ad3cbed9e4/NSX_Controller_c97459f1-3845-436f-8e03-60ad3cbed9e4.vmx |grep -i ethernet0.virtualDev
ethernet0.virtualDev = "vmxnet3"

The key setting in the VMX that we are interested in is ethernet0.virtualDev. As seen above, the type is vmxnet3 on my controllers as they were created from a freshly deployed 6.2.5 environment. If you see e1000 here, your controllers were deployed from a 6.1.4 or older setup and have never been re-deployed.

Scenario 2 – Updating the Disk Partitioning Layout

The second scenario would be if your controllers were initially deployed in a version of NSX prior to 6.2.3. Since 6.2.3 was pulled shortly after release, 6.2.4 would be more relevant starting point.

A statement you’ll find in the NSX 6.2.4 release notes summarizes this change well:

“…New installations of NSX 6.2.3 or later will deploy NSX Controller appliances with updated disk partitions to provide extra cluster resiliency. In previous releases, log overflow on the controller disk might impact controller stability. In addition to adding log management enhancements to prevent overflows, the NSX Controller appliance has separate disk partitions for data and logs to safeguard against these events. If you upgrade to NSX 6.2.3 or later, the NSX Controller appliances will retain their original disk layout.”

Again, it’s possible you may never run into a problem due to the old partitioning layout, but it’s always wise to take advantage of ‘optional’ resiliency enhancements like this. This is especially true for such a critical component of the NSX control-plane.

Although there isn’t a supported way to enter the root shell on a controller appliance, the ‘show status’ command will provide you with the partitioning layout. Here is the layout on a newer 6.2.5 controller with the newer partitioning:

nsx-controller # show status
Version: 4.0.6 (Build 48886)

Current Time: Fri, 25 Aug 2017 15:01:17 +0000
Uptime: 32 days 1 hour 23 minutes 16 seconds

Load Average: 0.10, 0.10, 0.13
Memory Usage: 3926484 kB (Total), 267752 kB (Free)
Disk Usage:
Filesystem                      1K-blocks    Used Available Use% Mounted on
/dev/sda1                         6596784 1181520   5057120  19% /
udev                              1953420       4   1953416   1% /dev
/dev/mapper/nvp-var               6593712 2349576   3886148  38% /var
/dev/mapper/nvp-var+cloudnet+data 3776568  147908   3417104   5% /var/cloudnet/data

Essentially, there are now three separate partitions for data instead of just one. Files for everything were just lumped together along with the Linux OS in a single partition previously. If some runaway log files filled the partition, key services would be impacted. By separating everything out, the key controller services like the zookeeper clustering service will still be able to write to disk.

I don’t have access to a pre-6.2.3 setup at the moment, but you can tell if your controller still uses the old partitioning layout by the absence of two partitions in the ‘show status output’. Both /dev/mapper/nvp-var and /dev/mapper/nvp-var+cloudnet+data only exist on controllers using the new partitioning layout.

Because disk partitioning is a pretty low-level, there was really no way to incorporate this into the automated upgrade process. To get the new layout, you’ll need to delete and re-deploy the controller appliances.

Scenario 3 – Upgrading to NSX 6.3.3

NSX 6.3.3 introduces a major change to the NSX controllers, replacing the underlying Linux OS with VMware’s new distribution dubbed Photon OS. The virtual hardware also changes slightly in 6.3.3 as the Photon OS based controllers require larger VMDK disks. Because this changes the entire foundation of the VM and is mandatory – unlike the vNIC and partitioning changes mentioned earlier – there is no way to perform in-place code upgrades. Each of the controllers needs to be deleted and re-deployed.

Thankfully, because of the mandatory nature of this change, VMware modified the upgrade process in 6.3.3 to automatically delete and re-deploy controllers for you.

From the NSX 6.3.3 release notes:

“In NSX 6.3.3, the underlying operating system of the NSX Controller changes. This means that when you upgrade to NSX 6.3.3, instead of an in-place software upgrade, the existing controllers are deleted one at a time, and new Photon OS based controllers are deployed using the same IP addresses.”

That said, no manual intervention is required when upgrading to 6.3.3. Controllers will be deleted and re-deployed automatically as part of the upgrade process. For more information, see the NSX public docs on the subject.

Some Warnings and Cautions

Before I go through the process of destroying and re-creating controller nodes, I really want to preface by saying that this process is potentially risky and should only be done during a maintenance window. It’s also very important that the process be done correctly to ensure you don’t run into any major problems. Below are some common pitfalls and other recommendations:

  1. Never just delete or remove the controller appliances from the vCenter inventory. NSX keeps track of the controllers in its database and doesn’t react well to having appliances yanked from under it. They must be deleted properly.
  2. Never deploy more than three controllers thinking you can just do a ‘cut over’. I.e. Don’t deploy six controllers and then delete the three old ones. A one to one replacement must be done and we never want fewer than two functional controllers in the cluster, and never more than three.
  3. If a controller fails to delete using the normal supported method, there is a reason. Don’t force the deletion without speaking to VMware technical support first. A common reason I’ve seen for this is a mismatched moref identifier for the appliance VM. If the NSX database thinks a controller is vm-73, but the actual VM is vm-75, the delete will fail. Removing controllers from the vCenter Inventory and re-adding them will cause this type of mismatch.
  4. It’s very important to validate that the control cluster health is good before proceeding to the next controller for deletion/re-deployment. Do not skip these checks and be patient with this process. Unless you have two fully functional controllers up and running in the cluster, you won’t have full control-plane functionality and a risk data-plane outage.
  5. If something goes wrong, you’ll still be okay if you have two controllers working in the cluster. Don’t just proceed in the interest of ‘moving forward’ because there is a good chance the other two will behave similarly. Contact VMware support if there is every any doubt.

A Quick Note on Force Deletion

While trying to delete a controller, you’ll be greeted by a ‘Forcefully Delete’ option. When selected, this option nukes everything related to the controller from the NSX database and NSX doesn’t care whether the VM appliance is successfully removed or not. This option should never used unless advised by VMware support for repairing specific cluster problems. As mentioned in the previous section, if a regular delete fails, there is always a specific reason. Using ‘Forcefully Delete’ to work around these problems can leave remnants behind and potentially cause problems with the cluster.

The warning presented by the NSX UI when you try to Force Delete a cluster node:

“Forcefully deleting a controller may result into NSX Controller cluster going down and the rest of the controllers may get disconnected, thereby resulting in problems like no majority and data inconsistency. Many operations like adding logical components will not be possible. If you still choose to delete the controller, it is recommended to also delete the rest of the controllers and recreate them.”

It’s also worth mentioning that the only time you’d need to forcefully delete a controller in a normal workflow is when deleting the last of three controllers. NSX will only delete the very last controller using the force option. Because we’re only removing one at a time, this should not apply here.

Controller Re-deployment Process

Again, you won’t need to use this process if you are upgrading to NSX 6.3.3 or later because the deletion and re-creation of appliances is handled in an automated manner. If you’d like to take advantage of a VMXNET3 adapter and/or the new partitioning layout in newer versions of NSX, please read on.

The overall goal here is to replace the NSX control cluster members one at a time, keeping in mind that as long as two controller nodes are online and healthy, the control-plane continues to function. In theory, you shouldn’t suffer any kind of control-plane or data-plane outage using this process.

**Edit 11/15/2017: As you may be aware, there have been a few new bugs discovered, including one that impacts the deployment/re-deployment of NSX controllers in 6.3.3 and 6.3.4. Please be sure to have a look at my post on the subject as well as the VMware KB before proceeding. If you are running 6.3.3, do not delete your controllers until you’ve implemented the workaround or patched. If you still have the old 6.3.3 upgrade bundle, you may not be able to upgrade.

Step 1 – Data collection and preparation

Before proceeding, we’ll need to collect some information about our current controller deployment. In order to deploy a controller, the following information is required:

  1. The vSphere Cluster that your controllers will live in.
  2. The datastore you want to use for your controllers.
  3. The network portgroup (standard or distributed) that your controllers are in.
  4. If you used a specific naming convention for your controllers, be sure to note it down.
  5. And finally, the IP address pool that’s used for the controllers. Note that when deleting controllers using this method, an IP will be freed up from the pool so even with just three IPs in a pool, this process should work.

Be sure to get the above information from the vSphere Web Client before proceeding so that you don’t have to go looking for it during the process.

Step 2 – Validate the control-cluster health

Before you begin the process, it’s very important to ensure you have a functional control cluster with all nodes connected to the cluster majority. As tempting as it may be, do not skip this check.

controller-redeploy-4

Checking in the UI is a good first place to look for obvious signs of trouble, but I would not rely on this method alone. If everything is green in the UI, log into each of the three controllers via SSH and run the show control-cluster status command:

nsx-controller # show control-cluster status
Type Status Since
--------------------------------------------------------------------------------
Join status: Join complete 08/25 15:26:19
Majority status: Connected to cluster majority 08/25 15:30:45
Restart status: This controller can be safely restarted 08/25 15:31:09
Cluster ID: f2849ee2-7fb6-4aca-abf4-2ca176337956
Node UUID: 309611b3-2441-4a1a-a12b-a5537e999c23

Role Configured status Active status
--------------------------------------------------------------------------------
api_provider enabled activated
persistence_server enabled activated
switch_manager enabled activated
logical_manager enabled activated
directory_server enabled activated

There are several key things you’ll want to validate before proceeding.

  1. The Join status must read ‘Join complete’. No other status is acceptable.
  2. The Majority status must read ‘Connected to cluster majority
  3. The Restart status must read ‘This controller can be safely restarted’.
  4. Each controller node must have the same ‘Cluster ID’.

If all three controllers look good, you can proceed.

Step 3 – Delete the first controller

Once we’ve confirmed the control cluster health is good, we can delete the first controller from the NSX UI. It doesn’t matter which one you do first, but in my example, I’ll start with controller-3 and work my way backwards.

To delete, simply select the ‘Management’ tab of the Installation section in the NSX UI and click the little red ‘X’ icon above.

controller-redeploy-9

As mentioned earlier, we want to use the normal ‘Delete’ option. Do NOT use ‘Forcefully Delete’.

controller-redeploy-10

NSX will execute several tasks related to the controller VM. First, it will power off the VM appliance, it will then delete it and remove all references of the controller in the database. It’s not unusual for this process to take 10 minutes or longer.

Once the controller has disappeared from the NSX ‘Management’ tab, it’s very important to check that the appliance itself was actually deleted from the vCenter inventory.

controller-redeploy-11

Check for both the successful power off and deletion tasks in the recent tasks pane and also confirm the VM is no longer present in the inventory.

Finally, we’ll want to check the cluster health from the other two surviving nodes using the same show control-cluster status command we used earlier. Ensure that both controllers look healthy.

I’d also recommend ensuring that the cluster is now only comprised of two nodes from the NSX controller node’s perspective. Just because NSX manager says there are two doesn’t necessarily guarantee the other controllers do. You can check this using the show control-cluster startup-nodes command:

nsx-controller # show control-cluster startup-nodes
172.16.10.43, 172.16.10.44

As seen above, my control cluster confirms only two members.

Step 4 – Replace the Deleted Controller.

Once the first controller has been deleted successfully and we’ve confirmed the health of the control cluster, we can go ahead and deploy a new one.

controller-redeploy-12

The process should be very straight forward and is the same as what was done during the initial deployment of NSX. Keep in mind that the name you specify is simply a label and that the moref identifier of the new controller will change.

controller-redeploy-13

NSX will report the new controller in the ‘Deploying’ status for some time, but you can monitor the tasks and events from the vSphere Web Client:

controller-redeploy-14

You can also watch the console of the new controller to confirm that it’s finished joining the cluster and ready for logins. It will usually be sitting a ‘Fetching initial configuration data’ for some time before it’s ready:

controller-redeploy-15

Once it’s powered up and ready, you can log-in via CLI and ensure that the ‘show control-cluster status’ output looks healthy as described earlier and that there are three startup-nodes again:

nsx-controller # show control-cluster status
Type Status Since
--------------------------------------------------------------------------------
Join status: Join complete 08/25 17:47:16
Majority status: Connected to cluster majority 08/25 17:47:13
Restart status: This controller can be safely restarted 08/25 17:47:14
Cluster ID: f2849ee2-7fb6-4aca-abf4-2ca176337956
Node UUID: f9a2d207-bf57-4f23-b075-1eefc58bfc8d

Role Configured status Active status
--------------------------------------------------------------------------------
api_provider enabled activated
persistence_server enabled activated
switch_manager enabled activated
logical_manager enabled activated
directory_server enabled activated

nsx-controller # show control-cluster startup-nodes
172.16.10.43, 172.16.10.44, 172.16.10.45

As seen above, my new controller is online and healthy. Most importantly it agrees with the other two controllers on the ID of the cluster and number of startup nodes.

You could also do a ‘show status’ on the controller to confirm that it has the new partitioning layout at this time as discussed earlier.

Step 5 – Rinse and Repeat.

It’s extremely important to verify the cluster health before proceeding with the deletion of the next cluster node. Aside from the checks in the previous section, this would also be a good time to do some basic connectivity tests. Make sure your distributed routers are functional and that your guests connected to logical switches are working normally.

If you delete the next controller while the cluster is in a bad state, there is a good chance you’ll be down to a single node and will be operating in a ‘read-only’ state. In this condition, any VTEP, ARP or MAC table changes in the environment – like those triggered by vMotions, etc – would fail to propagate. This is definitely not a situation you’d want to be in.

Once you are sure it’s safe to proceed, simply repeat steps 3 and 4 above for the remaining two controllers.

Conclusion

So there you have it. The process can be a bit of a nail-biting experience in a production environment, but if you take the appropriate precautions everything should work without a hitch. The reward for your patience will be a more resilient control cluster with virtual hardware configured as VMware intended.

Thanks for reading! If you have any questions, please feel free to post below.

Building a Retro Gaming Rig – Part 1

Welcome to a new hardware build series where I’ll be sharing my experiences building a retro Intel gaming system. In part 1, I’ll be going over some of the hardware I picked out for this build and doing a bit of a photo shoot. I apologize in advance for the copious amounts of trivial history and nostalgic rambling in this post, so please feel free to skip the first few sections if you’d like to get straight to the gear. For those who appreciate the trip down memory lane, please read on! 🙂

Introduction

Twenty seventeen has been a real year of nostalgia for me when it comes to PC hardware. It all started earlier this summer when my Brother in-law was cleaning out his basement and gave me a couple of classic PC systems. One was a 1993 era DEC 486 and the other, a 1995 era NEC Pentium 90 system. These systems brought back so many good memories and were a real joy to use and restore. Although the first PC we had in my house growing up was a much older clone from the late 80s – possibly an 8088 or 286, I was really too young to appreciate it and preferred my NES/SNES at the time. I remember teaching myself how to use DOS with it but it wasn’t until 1994 that I could convince my parents that I needed a new computer – you know, for gaming educational purposes.  After doing some research and shopping around, I got a shiny new 486 DX2/66. It may not have been state of the art at the time, but with 8MB of RAM, a CDROM, 430MB Hard drive and a real Soundblaster 16, it ran like a dream. I spent hours upon hours on that machine and it wasn’t long before I had added a 14.4 modem and really got into the BBS scene as well. About a year later, my parents surprised me with a top of the line NEC Pentium 100 system with 16MB of RAM and the 486 replaced the old monochrome 8088 in the basement. Coincidentally it was almost identical to the Pentium 90 given to me by my brother in-law.

The Goal

Although these two retro rigs were a lot of fun to restore and use, I wanted to build a machine that was a bit more flexible, could be customized and used for not only demanding DOS/Windows 9x games but also most of the classics as well. The main issue with these two older machines is that they are both proprietary AT based systems with many BIOS quirks, compatibility issues and other limitations. Even the Pentium 90 with 24MB of RAM just wasn’t fast enough for newer DOS games like Quake. In short, I wanted a custom retro build that was fast, reliable and compatible – something with some nostalgic value and some pizzazz but also well suited for daily use.

Continue reading “Building a Retro Gaming Rig – Part 1”

Finding moref IDs for NSX API Calls

NSX was designed from the ground up to be very configurable via restful API calls. You can create, modify and remove objects and configuration using APIs. This also makes NSX very powerful with automation and other cloud management platforms and tools.

One of the most common questions I get from those getting started with API calls is on the unique identifiers – often referred to as ‘morefs’ or ‘managed object reference’ identifiers – used in these calls. The vCenter Server ‘Managed Object Browser’ is often used as the primary method to obtain these moref identifiers, but it can be a bit daunting to navigate for those unfamiliar with it. It also requires full administrative privileges to vCenter so may not always be an option for all users either. Another way to find them is to use PowerNSX – but again, you may not be comfortable with PowerCLI yet, or maybe it’s not deployed in the environment.

Not to worry – there are some surprisingly simple ways to obtain moref identifiers for all kinds of vCenter and NSX objects. In this post, I’ll show some of these tricks that I’ve picked up over time.

NSX Edges and DLRs

NSX edges and DLRs are the easiest objects to obtain morefs for. VMware very thoughtfully included a column in the vSphere Web Client NSX Edges view called ‘Id’.

nsx-moref1

Above, you can see that edge ‘esg-a1’ has a unique identifier called ‘edge-2’. This ID is indeed the moref.

Another way to get this information would be to use the NSX Manager Central CLI feature. From the manager CLI, the following command will get you a similar output:

nsxmanager.lab.local> show edge all
NOTE: CLI commands for Edge ServiceGateway(ESG) start with 'show edge'
 CLI commands for Distributed Logical Router(DLR) Control VM start with 'show edge'
 CLI commands for Distributed Logical Router(DLR) start with 'show logical-router'
 Edges with version >= 6.2 support Central CLI and are listed here
Legend:
Edge Size: Compact - C, Large - L, X-Large - X, Quad-Large - Q
Edge ID Name Size Version Status
edge-1 dlr-a1 C 6.2.5 GREEN
edge-2 esg-a1 C 6.2.5 GREEN
edge-3 esg-a2 C 6.2.5 GREEN

Virtual Machines and Appliances

In the previous section, we looked at the NSX Edge moref identifiers, but these appliances also exist as unique virtual machines from a vCenter Server perspective. Every virtual machine in the vCenter inventory – including controllers, DLRs, ESGs etc – can also be referred to as virtual machines with a vm-X moref identifier. For example, my DLR called dlr-a1 is edge-1 from an NSX perspective but actually exists as two DLR appliances in high availability mode. Having virtual machine moref identifiers allows us to uniquely identify each appliance.

To find out a virtual machine’s moref identifier, an easy method to use is to look at the URL in the vSphere Web Client. For example, I’ve gone to the ‘Hosts and Clusters’ view in my lab and selected the first of two dlr-a1 appliances:

nsx-moref2

At the end of the URL string in the address bar, we can see that the moref identifier is included. It’s somewhat stuffed in there in and not always noticeable, but knowing that the virtual machine moref always begins with ‘vm-‘ followed by a numerical value, we are able to pick it out. In the above example, the DLR appliance known as dlr-a1-0 has a virtual machine moref value of vm-675.

Looking at the second of the two appliances in the same way – the standby in the HA pair – I get a different virtual machine moref value of vm-677.

NSX Controllers

From an NSX perspective, NSX controllers also have a moref identifier prefixed by ‘controller-‘. Again, VMware thoughtfully included this information underneath the controller IP address in the ‘Controller Node’ column. This can be found in the Installation section of the NSX UI in the vSphere Web Client under the ‘Management’ tab.

nsx-moref3

One thing to be careful about is that the ‘Name’ column does not necessarily provide the moref identifier. VMware recently allowed the ability to name controllers with a friendly name in NSX 6.2. Just like ESGs and DLRs, NSX controllers also have virtual machine moref identifiers that are sometimes needed.

vSphere Clusters

As you have probably noticed, vSphere clusters are often the configuration delimiter for many things in NSX, including host preparation, firewall status etc. As such, many API calls will reference cluster objects. The cluster moref is not prefixed by ‘cluster-‘ as you might expect, but rather ‘domain-c‘. To determine the cluster moref, we can use the same trick that we used for finding the virtual machine IDs. As you’ll discover the URL address in the vSphere Web Client can tell us the moref for numerous objects.

nsx-moref4

In my lab above, cluster compute-a is domain-c121.

You can also obtain a list of NSX prepared clusters and their moref IDs using the NSX Central CLI. From an NSX manager CLI prompt:

nsxmanager.lab.local> show cluster all
No. Cluster Name Cluster Id Datacenter Name Firewall Status
1 compute-r domain-c641 lab Enabled
2 compute-vcp domain-c705 lab Not Ready
3 compute-a domain-c121 lab Enabled
4 management domain-c205 lab Not Ready

ESXi Hosts

There are a couple of different ways that you can obtain the moref for an ESXi host. As you’d expect, the moref is always prefixed by ‘host-‘. Just like for clusters and virtual machines, you can select the object in the vSphere Web Client and get the host moref from the address bar. Alternatively, there is another easy way to get the host moref from any NSX prepared host from the CLI.

When ESXi hosts are prepared, NSX pushes numerous RabbitMQ configuration variables that instruct it how to to communicate with NSX manager. One of these RabbitMQ parameters called /UserVars/RmqHostId actually includes the host moref.

[root@esx-a1:~] esxcfg-advcfg -g /UserVars/RmqHostId
Value of RmqHostId is host-223

As you can see, host esx-a1 has a moref of host-223.

Another option is to use the NSX Central CLI for viewing the contents of a cluster. Once you have the cluster ID – domain-c121 in my example – the following command can be used to view the hosts in the cluster and get the moref identifiers:

nsxmanager.lab.local> show cluster domain-c121
Datacenter: lab
Cluster: compute-a
No. Host Name Host Id Installation Status
1 esx-a1.lab.local host-223 Enabled
2 esx-a2.lab.local host-225 Enabled

Transport Zones

The moref IDs for transport zones aren’t required often, but if you ever need to find one, the NSX Central CLI can get you this information.

The below output will list out all logical switches, but will also correlate the transport zone name with a moref prefixed by ‘vdnscope-‘. Logical switch UUID values can also be identified using this command:

nsxmanager.lab.local> show logical-switch list all
NAME UUID VNI Trans Zone Name Trans Zone ID
Transit VXLAN1 210b616b-691a-470f-ad1e-1cc24b485d0d 5000 Primary TZ vdnscope-1
Blue Network 7a068fb5-9b17-4779-9b3d-0d81a439189f 5001 Primary TZ vdnscope-1
Green Network 22f01e70-18ae-4d3e-a683-be08d244e919 5002 Primary TZ vdnscope-1
Yellow Network 8510eedb-e6a8-41a0-bae0-79f3af8630be 5003 Primary TZ vdnscope-1
Red Network 4996f8f1-68c9-43de-9207-fbe038543133 5004 Primary TZ vdnscope-1
Purple Network bdd16f8e-805f-45df-a5c7-00a0ce319cc9 5005 Primary TZ vdnscope-1
Test Network A 72597655-ad88-4e32-9baf-c724d49c9a7c 5006 Primary TZ vdnscope-1

As you can see above, I have only one transport zone called ‘Primary TZ’ with a moref of vdnscope-1.

Datastores

Some API calls expect input including the moref for a specific datastore. An example would be trying to deploy a new ESG or controller – NSX needs to know where to store the VM. Once again, going to the datastores view in the vSphere Web Client allows us to select the specific datastore and the ‘datastore-‘ prefixed moref is visible in the address bar.

nsx-moref5

Above you can see my datastore called shared-ssd0 equates to moref datastore-621.

IP Pools

Some API calls involving the deployment of objects require the moref identifier of an IP address pool. Unfortunately, this moref can’t be found in the GUI, but with a couple of quick API calls can be obtained pretty easily.

First, we’ll use an API call to query all IP pools on the NSX manager. This will provide an output that will include the moref identifier of the pool in question:

GET https://NSX-Manager-IP-Address/api/2.0/services/ipam/pools/scope/scopeID

As you can see above, the ‘scope ID’ is also required to run this GET call. In every instance I’ve seen, using globalroot-0 as the scopeID works.

ippoolAPI-4

The various IP pools will be separated by <ipamAddressPool> XML tags. You’ll want to identify the correct pool based on the IP range listed or by the text in the <name> field. In my example, I want to find the moref for the Controller Pool, which is found in the section below:

<ipamAddressPool>
<objectId>ipaddresspool-1</objectId>
<objectTypeName>IpAddressPool</objectTypeName>
<vsmUuid>4226CDEE-1DDA-9FF9-9E2A-8FDD64FACD35</vsmUuid>
<nodeId>fa4ecdff-db23-4799-af56-ae26362be8c7</nodeId>
<revision>1</revision>
<type>
<typeName>IpAddressPool</typeName>
</type>
<name>Controller Pool</name>
<scope>
<id>globalroot-0</id>
<objectTypeName>GlobalRoot</objectTypeName>
<name>Global</name>
</scope>
...
<snip>

As you can see above, the IP pool is identified by the moref identifier ipaddresspool-1.

Conclusion

And there you have it. Much easier than navigating through the vCenter Managed Object Browser and doesn’t require full administrative privileges in vCenter. For the address bar trick, you’ll need to have a minimum of read-only privileges to the object you are trying to select, as well as read-only access to the NSX UI in order to view the Edge list and installation section. The NSX Central CLI commands do require that you log in to the NSX manager via CLI using an administrator account, but most individuals managing NSX would have this access.

If there are any other morefs or identifiers you are having difficulty locating, please leave a comment and I’d be happy to post some methods.