Troubleshooting – Page 2

NSX Troubleshooting Scenario 13

Welcome to the thirteenth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“After recovering from a storage outage, we’re unable to re-deploy any of our missing DLRs and ESGs. Help!”

With this type of a problem description, the first order of business is to find out EXACTLY what happened. After a lengthy discussion with the fictional customer, we were able to piece together the following sequence of events:

The SAN suffered a catastrophic failure.
All of the LUNs were continuously replicated to another SAN over the years, so these replicated LUNs were presented to the hosts in the compute-a cluster.
After a rescan, the VMFS volumes were re-signatured and the datastores and all files were again accessible.
All of the VMs on those datastores were manually added back to the vCenter Inventory except the DLRs and ESGs.
All DLRs and ESGs were deleted from the datastore so that they could be freshly re-deployed.

The customer did realize that in point number 5 above that any ESGs re-added to the inventory would no longer be valid because of mismatched UUIDs. Deleting these from disk and re-deploying was a good idea.

NSX is throwing many high and critical events because of the missing ESG and DLR appliances, as expected.

tshoot13a-1

There are six appliances in total, including three DLRs and three ESGs.

Continue reading “NSX Troubleshooting Scenario 13”

NSX Troubleshooting Scenario 12 – Solution

Welcome to the twelfth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was getting some unexpected behavior from a couple of firewall rules. Despite the rules being properly constructed, one VM called linux-a3 continued to be accessible via SSH.

tshoot12a-2 — The two rules in question – 1007 and 1008 – look to be constructed correctly.

We confirmed that the IP addresses for the machines in the security group where translated correctly by NSX and that the ruleset didn’t appear to be the problem. Let’s recap what we know:

VM linux-a2 seems to be working correctly and SSH traffic is blocked.
VM linux-a3 doesn’t seem to respect rule 1007 for some reason and remains accessible via SSH from everywhere.
Host esx-a3 where linux-a3 resides doesn’t appear to log any activity for rule 1007 or 1008 even though those rules are configured to log.
The two VMs are on different ESXi hosts (esx-a1 and esx-a3).
VMs linux-a2 and linux-a3 are in different dvPortgroups.

Given these statements, there are several things I’d want to check:

How can the two VMs have proper IP connectivity in VXLAN and VLAN porgroups as observed?
Is the DFW working at all on host esx-a3?
Did the last rule publication make it to host esx-a3 and does it match what we see in the UI?
Is the DFW (slot-2) dvfilter applied to linux-a3 correctly?

Continue reading “NSX Troubleshooting Scenario 12 – Solution”

NSX Troubleshooting Scenario 12

Welcome to the twelfth installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

For this scenario today, I’ve created some supplementary video content to go along this post:

The Scenario

As always, we’ll start with a brief problem statement:

“I am just getting started with the NSX distributed firewall and see that the rules are not behaving as they should be. I have two VMs, linux-a2 and linux-a3 that should allow SSH from only one specific jump box. The linux-a3 VM can be accessed via SSH from anywhere! Why is this happening?”

To get started with this scenario, we’ll most certainly need to look at how the DFW rules are constructed to get the desired behavior. The immense flexibility of the distributed firewall allows for dozens of different ways to achieve what is described.

Here are the two VMs in question:

tshoot12a-4 — The VM linux-a2 is currently on host esx-a1 with IP address 172.16.15.10. It’s sitting on a logical switch.

And linux-a3:

tshoot12a-5 — The VM linux-a3 is currently on host esx-a3 with IP address 172.16.15.11. It’s sitting on a VLAN backed dvPortgroup.

There are a couple of interesting observations above. The first is that both VMs have a security tag applied called ‘Linux-A VMs’. The other is a bit more of an oddity – one VM is in a distributed switch VLAN backed portgroup called dvpg-a-vlan15, and the other is in a VXLAN backed logical switch. Despite this, both VMs are in the same 172.16.15.0/24 subnet.

Continue reading “NSX Troubleshooting Scenario 12”

Jumbo Frames and VXLAN Performance

Using an 8900 MTU for better VXLAN throughput and lower packet rates.

VXLAN overlay technology is part of what makes software defined networking possible. By encapsulating full frames into UDP datagrams, L2 networks can be stretched across all manners of routed topologies. This breaks down the barriers of physical networking and builds the foundation for the software defined datacenter.

VXLAN or Virtual Extensible LAN is an IETF standard documented in RFC 7348. L2 over routed topologies is made possible by encapsulating entire L2 frames into UDP datagrams. About 50 bytes of outer header data is added to every L2 frame because of this, meaning that for every frame sent on a VXLAN network, both an encapsulation and de-encapsulation task must be performed. This is usually performed by ESXi hosts in software but can sometimes be offloaded to physical network adapters as well.

In a perfect world, this would be done without any performance impact whatsoever. The reality, however, is that software defined wizardry often does have a small performance penalty associated with it. This is unavoidable, but that doesn’t mean there isn’t anything that can be done to help to minimize this cost.

If you’ve been doing some performance testing, you’ve probably noticed that VMware doesn’t post statements like “You can expect X number of Gbps on a VXLAN network”. This is because there are simply too many variables to consider. Everything from NIC type, switches, drivers, firmware, offloading features, CPU count and frequency can play a role here. All these factors must be considered. From my personal experience, I can say that there is a range – albeit a somewhat wide one – of what I’d consider normal. On a modern 10Gbps system, you can generally expect more than 4Gbps but less than 7Gbps with a 1500 MTU. If your NIC supports VXLAN offloading, this can sometimes be higher than 8Gbps. I don’t think I’ve ever seen a system achieve line-rate throughput on a VXLAN backed network with a 1500 MTU regardless of the offloading features employed.

What if we can reduce the amount of encapsulation and de-encapsulation that is so taxing on our hypervisors? Today I’m going to take an in-depth look at just this – using an 8900 MTU to reduce packet rates and increase throughput. The results may surprise you!

Continue reading “Jumbo Frames and VXLAN Performance”

NSX Troubleshooting Scenario 11 – Solution

Welcome to the eleventh installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As you’ll recall in the first half, our fictional customer was seeing some HA heartbeat channel alarms in the new HTML5 NSX dashboard.

tshoot11a-1

After doing some digging, we were able to determine that the ESG had an interface configured for HA on VLAN 16 and that from the CLI, the edge really was complaining about being unable to reach its peer.

You probably noticed in the first half, that the HA interface doesn’t have an IP address configured. This may look odd, but it’s fine. Even if you did specify a custom /30 IP address for HA purposes, it would not show up as an interface IP address here. Rather, you’d need to look for one specified in the HA configuration settings here:

Continue reading “NSX Troubleshooting Scenario 11 – Solution”

NSX Troubleshooting Scenario 11

Welcome to the eleventh installment of my NSX troubleshooting series. What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

The Scenario

As always, we’ll start with a brief problem statement:

“One of my ESXi hosts has a hardware problem. Ever since putting it into maintenance mode, I’m getting edge high availability alarms in the NSX dashboard. I think this may be a false alarm, because the two appliances are in the correct active and standby roles and not in split-brain. Why is this happening?”

A good question. This customer is using NSX 6.4.0, so the new HTML5 dashboard is what they are referring to here. Let’s see the dashboard alarms first hand.

tshoot11a-1

This is alarm code 130200, which indicates a failed HA heartbeat channel. This simply means that the two ESGs can’t talk to each other on the HA interface that was specified. Let’s have a look at edge-3, which is the ESG in question.

Continue reading “NSX Troubleshooting Scenario 11”

NSX Troubleshooting Scenario 10 – Solution

Welcome to the tenth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As we saw in the first half, our fictional administrator was attempting to configure an ESG load balancer for both TCP and UDP port 514 traffic. Below is the high-level topology:

tshoot10a-1

One of the first things to keep in mind when troubleshooting the NSX load balancer is the mode in which it’s operating. In this case, we know the customer is using a one-armed load balancer. The tell-tale sign is that the ESG sits in the same VLAN as the pool members with a single interface. Also, the pool members do not have the ESG configured as their default gateway.

We also know based on the screenshots in the first half that the load balancer is not operating in ‘Transparent’ mode – so traffic to the pool members should appear as though it’s coming from the load balancer virtual IP, not from the actual syslog clients. The packet capture the customer did proves that this is actually not the case.

That said, how exactly does an NSX one-armed load balancer work?

As traffic comes in on one of the interfaces and ports configured as a ‘virtual server’, the load balancer will simply forward the traffic to one of the pool members based on the load balancing algorithm configured. In our case, it’s a simple ‘round robin’ rotation of the pool members per session/socket. But forwarding would imply that the syslog servers would see traffic coming from the originating source IP of the syslog client. This would cause a fundamental problem with asymmetry when the pool member needs to reply. When it does, the traffic would bypass the ESG and be sent directly back to the client. This would be fine with UDP, which is connection-less, but what about TCP?

Continue reading “NSX Troubleshooting Scenario 10 – Solution”

NSX Troubleshooting Scenario 10

Welcome to the tenth installment of my NSX troubleshooting series – a milestone number for the one-year anniversary of vswitchzero.com. I wasn’t sure how many of these I’d write, but I’ve gotten lots of positive feedback so if I can keep thinking of scenarios, I’ll keep going!

What I hope to do in these posts is share some of the common issues I run across from day to day. Each scenario will be a two-part post. The first will be an outline of the symptoms and problem statement along with bits of information from the environment. The second will be the solution, including the troubleshooting and investigation I did to get there.

I’ll try to include some questions as well for educational purposes in each post.

The Scenario

As always, we’ll start with a brief problem statement:

“I’m using an ESG load balancer to send syslog traffic to a pool of two Linux servers. I can only seem to get UDP syslog traffic to arrive at the pool members. TCP based syslog traffic doesn’t work. I’m using a one-armed load balancer. If I do a packet capture, all I see is the UDP traffic but it’s not coming from the load balancer”

Using the NSX load balancer services for syslog purposes is not at all uncommon. We see this frequently with products like Splunk as well as others. Since syslog traffic can be very heavy, this is a good use case.

When it comes to troubleshooting NSX load balancer issues, triple checking the configuration is key. In speaking with the customer, this is his desired outcome:

One-armed load balancer in VLAN 15.
No routing done by the edge. Default gateway configuration only and a single interface for simplicity.
Transparency is not required – the source IP can be the load balancer as the required source information is in the syslog data transmitted.
A mix of both TCP and UDP port 514 traffic is to be load balanced.

Here is a basic, high-level topology provided by the customer:

tshoot10a-1

The one armed load balancer called esg-lb1 is sitting in VLAN 15. It’s default gateway is the SVI interface of the physical switch (172.16.15.1). There is only one hop between the ESXi hosts – the syslog clients – and the ESG in VLAN 15. Because this is a one-armed topology, the syslog-a1 and syslog-a2 servers are using the same switch SVI as their default gateway.

Continue reading “NSX Troubleshooting Scenario 10”

Blank Error While Adding NSX DLR or ESG Interfaces

I recently deployed NSX 6.3.2 in my home lab to do some testing. After deploying a DLR, I went back in to add some additional interfaces and was greeted by a ‘blank’ or null error message. Having run into this problem before, I thought it may be a good idea to give some additional context to VMware KB 2151309.

dlrblankerror-1

As you can see above, there is no text associated with the error. There are no problems with the IP or mask I used, and it doesn’t seem clear why this would be failing.

You would expect to find more detail in the NSX Manager vsm.log file, but interestingly there is nothing there at all for this exception. That’s because this isn’t an NSX fault, but rather something in the vSphere Web Client.

Continue reading “Blank Error While Adding NSX DLR or ESG Interfaces”

NSX Troubleshooting Scenario 9 – Solution

Welcome to the ninth installment of a new series of NSX troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of scenario nine. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As we saw in the first half, our fictional administrator was unable to install the NSX VIBs on the cluster called compute-a:

We also saw that there were two different NSX licences added to vCenter. One called ‘Endpoint’ and the other ‘Enterprise’.

You can see that the ‘Usage’ for both licenses is currently “0 CPUs”, but that’s because it hasn’t been installed on any ESXi hosts yet to consume any. What’s most telling, however, is the small little grey exclamation mark on the license icon. If I hover over this, I get a message stating:

“The license is not assigned. To comply with the EULA, assign the license to at least one asset.”

Continue reading “NSX Troubleshooting Scenario 9 – Solution”