Tag Archives: ESG

ESG/DLR tmpfs partition fills in NSX 6.3.6 and 6.4.1

If you are running NSX 6.3.6 or 6.4.1, you should take a close look at VMware KB 57003. A newly discovered issue can result in the tmpfs partition of DLRs and ESGs from filling up, rendering the appliances unmanageable.

On a positive note, there should be no datapath impact because of a full tmpfs partition. You just won’t be able to push any configuration changes to the ESG or DLR in this state.
This occurs because of a file related to HA in /run that will slowly grow until it fills the partition. The file in question is ‘ha.cid.Out’ and contains HA diagnostic information. You can find it in the /run/vmware/vshield/cmdOut directory.

If you have a very stable environment, it’s quite possible that you’ll never run into this problem. The ha.cid.Out file is created and updated only after an HA event occurs – like a failover or split-brain recovery for example. Once the file is created, however, it receives regular updates and will inevitably grow.

Based on the rate in which the file grows, a compact size ESG or DLR has about a month after an HA event before this becomes a problem. Larger sized ESGs have more memory, and hence larger tmpfs partitions. Below is an estimate based on tmpfs partition size on each size of appliance:

All DLRs (256MB tmpfs): 4 weeks
Compact ESG (256MB tmpfs): 4 weeks
Large ESG (497MB tmpfs): 8 weeks
Quad Large ESG (1024MB tmpfs): 4 months
X-Large ESG (3.9GB tmpfs): >1 year

Unfortunately, it doesn’t appear that the ha.cid.Out file can be deleted or purged while the ESG/DLR is in operation. The file is locked for editing and the only safe way to recover is to reboot the appliance. Again, all of the features including routing and packet forwarding will continue to work just fine with a full tmpfs partition. You just won’t be able to make any changes.

Disabling ESG HA will prevent this from happening, but I’d argue that being protected by HA is more important than the potential for an ESG to become unmanageable.

You can monitor your ESG’s tmpfs partition using the show system storage CLI command:

esg-lb1.vswitchzero.net-0> show system storage
Filesystem      Size   Used   Avail   Use%   Mounted on
/dev/root       444M 366M 55M 88% /
tmpfs           497M 80K 497M 1% /run
/dev/sda2        43M 2.2M 38M 6% /var/db
/dev/sda3        27M 413K 25M 2% /var/dumpfiles
/dev/sda4        32M 1.1M 29M 4% /var/log

If you see it slowly creeping up in size at a regular interval, it would be a good idea to start planning for a maintenance window to reboot the appliance.

I can’t comment on release dates, but it’s very likely that this will be fixed next release of 6.4.x, which should out very soon. The 6.3.x fix for this may be further out, so a jump to 6.4.2 may be your best bet if this proves to a serious problem for you.

I hope this is helpful.

NSX Troubleshooting Scenario 11

Welcome to the eleventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“One of my ESXi hosts has a hardware problem. Ever since putting it into maintenance mode, I’m getting edge high availability alarms in the NSX dashboard. I think this may be a false alarm, because the two appliances are in the correct active and standby roles and not in split-brain. Why is this happening?”

A good question. This customer is using NSX 6.4.0, so the new HTML5 dashboard is what they are referring to here. Let’s see the dashboard alarms first hand.

tshoot11a-1

This is alarm code 130200, which indicates a failed HA heartbeat channel. This simply means that the two ESGs can’t talk to each other on the HA interface that was specified. Let’s have a look at edge-3, which is the ESG in question.

Continue reading