NSX-T Troubleshooting Scenario 2 – Solution

Welcome to the second installment of a new series of NSX-T troubleshooting scenarios. Thanks to everyone who took the time to comment on the first half of the scenario. Today I’ll be performing some troubleshooting and will show how I came to the solution.

Please see the first half for more detail on the problem symptoms and some scoping.

Getting Started

As we saw in the first half, our fictional customer was having northbound communication problems because the physical core router was not getting any of the NSX advertised routes:

vyos@router-core:~$ sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
B - BGP, > - selected route, * - FIB route

S>* 0.0.0.0/0 [1/0] via 172.16.1.12, eth0.1
C>* 10.99.99.0/27 is directly connected, eth0.2005
C>* 127.0.0.0/8 is directly connected, lo
C>* 172.16.1.0/24 is directly connected, eth0.1
C>* 172.16.11.0/24 is directly connected, eth0.11
C>* 172.16.76.0/24 is directly connected, eth0.76
C>* 172.16.98.0/24 is directly connected, eth0.98

Based on what we observed in the first half, we can make a few assertions:

  1. The T1 routers are advertising their routes just fine to the T0 (a total of 8 routes).
  2. The T0 router is peering with the core router successfully because we received BGP routes from the core router.
  3. The T0 router is configured for route redistribution of NSX connected and Static routes.

Let’s just run through a couple of quick tests to confirm point one above and make sure that the T0 can communicate with the core router. From VRF 2 (the T0 SR), we’ll check the interface IP first:

edge-e1(tier0_sr)> get interfaces
Logical Router
UUID VRF LR-ID Name Type
3d5e5d06-8506-476e-b94e-42bee00ff1ce 2 2 SR-t0-router SERVICE_ROUTER_TIER0
interfaces
<snip>
interface : 9579f837-4d16-452d-b6db-e2f25865d3a4
ifuid : 285
name : VLAN2005
mode : lif
IP/Mask : 10.99.99.10/27
MAC : 00:50:56:95:4b:ed
LS port : ece998a4-6c0d-4c9e-82f8-f295da1fe6fe
urpf-mode : NONE
admin : up
op_state : up
MTU : 9000
<snip>

The netmask is a /27, which is correct. Let’s try to ping 10.99.99.9, which is the core router:

edge-e1(tier0_sr)> ping 10.99.99.9
PING 10.99.99.9 (10.99.99.9): 56 data bytes
64 bytes from 10.99.99.9: icmp_seq=0 ttl=64 time=3.288 ms

No problems there. We can also see that the routers are peering without issue:

edge-e1(tier0_sr)> get bgp neighbor

BGP neighbor: 10.99.99.9 Remote AS: 64512
BGP state: Established, up
BFD state: Not configured
Hold Time: 180s Keepalive Interval: 60s
Capabilities:
4Byte ASN: advertised and received
Route Refresh: advertised and received
Graceful Restart: None
Restart Remaining Time: 0
Address Family: IPv4 Unicast:advertised and received
Messages: 154 received, 174 sent
Minimum time between advertisements: 30s (default)
1 Connections established, 1 dropped
Local host: 10.99.99.10, Local port: 36012
Remote host: 10.99.99.9, Remote port: 179
Route Refresh: 0 received, 0 sent
For Address family: IPv4 Unicast:advertised and received
Prefixes: 6 received 0 sent 0 advertised

As expected, the peers are ‘Established’ – no issues there either. Notice the prefix count – 6 received and 0 sent. This confirms that the T0 really hasn’t sent any routes to 10.99.99.9 – could this be a redistribution issue?

Let’s have another quick look at the routing table on the T0:

edge-e1(tier0_sr)> get route

Flags: c - connected, s - static, b - BGP, ns - nsx_static
nc - nsx_connected, rl - router_link, t0n: Tier0-NAT, t1n: Tier1-NAT
t1l: Tier1-LB VIP, t1s: Tier1-LB SNAT

Total number of routes: 17

b 0.0.0.0/0 [20/0] via 10.99.99.9
c 10.99.99.0/27 [0/0] via 10.99.99.10
rl 100.64.48.0/31 [0/0] via 169.254.0.1
rl 100.64.48.2/31 [0/0] via 169.254.0.1
c 169.254.0.0/28 [0/0] via 169.254.0.2
b 172.16.1.0/24 [20/0] via 10.99.99.9
b 172.16.11.0/24 [20/0] via 10.99.99.9
b 172.16.76.0/24 [20/0] via 10.99.99.9
b 172.16.98.0/24 [20/0] via 10.99.99.9
ns 172.18.9.0/24 [3/0] via 169.254.0.1
ns 172.18.10.0/24 [3/0] via 169.254.0.1
ns 172.18.11.0/24 [3/0] via 169.254.0.1
ns 172.18.12.0/24 [3/0] via 169.254.0.1
ns 172.18.17.0/24 [3/0] via 169.254.0.1
ns 172.18.18.0/24 [3/0] via 169.254.0.1
ns 172.18.19.0/24 [3/0] via 169.254.0.1
ns 172.18.20.0/24 [3/0] via 169.254.0.1

We saw previously that the T0 was configured to redistribute ‘NSX Connected’ routes just like the T1s were, but are these routes NSX connected from the perspective of the T0? From the T1 they are, but pay close attention to the flags before each route entry in the table. The flag ‘ns’ equates to ‘NSX Static’, not ‘NSX Connected’.

The terminology can be a bit confusing, but any routes learned by a T1 and redistributed to a T0 are considered ‘NSX Static’ routes. Remember – there is no routing protocol used between T1 and T0 routers. These are not ‘connected’ to the T0 – i.e. they need to be forwarded to the T1 first – and they are not true ‘Static’ routes either.

You may also be wondering why the next-hop for these ‘NSX Static routes’ is 169.254.0.1. These 169.x addresses are associated with the ‘RouterLink’ between the T0 Service Router and the T0 Distributed Router instance. The SR has 169.254.0.2, and the DR is 169.254.0.1. Traffic will first be handed off to the T0 DR for routing to the T1s.

Logical Router
UUID VRF LR-ID Name Type
3d5e5d06-8506-476e-b94e-42bee00ff1ce 2 2 SR-t0-router SERVICE_ROUTER_TIER0
Interfaces
Interface : 0fcf82bf-1a52-4957-bc63-ce1d63c305ef
Ifuid : 302
Name : bp-sr0-port
Mode : lif
IP/Mask : 169.254.0.2/28;fe80::50:56ff:fe56:5300/64
MAC : 02:50:56:56:53:00
VNI : 71694
LS port : d2e97d4c-15ee-45c7-8af3-8807c808f93c
Urpf-mode : NONE
Admin : up
Op_state : down
MTU : 9000
<snip>

Thankfully, this is a quick fix – all we need to do is ensure ‘NSX Static’ routes are also redistributed from the T0 router:

nsxt-tshoot2b

And now we can see all of the subnets attached to the T1 routers advertised to the core router:

vyos@router-core:~$ sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
> - selected route, * - FIB route

S>* 0.0.0.0/0 [1/0] via 172.16.1.12, eth0.1
C>* 10.99.99.0/27 is directly connected, eth0.2005
C>* 127.0.0.0/8 is directly connected, lo
C>* 172.16.1.0/24 is directly connected, eth0.1
C>* 172.16.11.0/24 is directly connected, eth0.11
C>* 172.16.76.0/24 is directly connected, eth0.76
C>* 172.16.98.0/24 is directly connected, eth0.98
B>* 172.18.9.0/24 [20/0] via 10.99.99.10, eth0.2005, 00:00:04
B>* 172.18.10.0/24 [20/0] via 10.99.99.10, eth0.2005, 00:00:04
B>* 172.18.11.0/24 [20/0] via 10.99.99.10, eth0.2005, 00:00:04
B>* 172.18.12.0/24 [20/0] via 10.99.99.10, eth0.2005, 00:00:04
B>* 172.18.17.0/24 [20/0] via 10.99.99.10, eth0.2005, 00:00:04
B>* 172.18.18.0/24 [20/0] via 10.99.99.10, eth0.2005, 00:00:04
B>* 172.18.19.0/24 [20/0] via 10.99.99.10, eth0.2005, 00:00:04
B>* 172.18.20.0/24 [20/0] via 10.99.99.10, eth0.2005, 00:00:04

Reader Feedback

Here’s some reader feedback on scenario 2. First, a great suggestion by James:

Also, Chris who commented in part 1 was right on:

“I would start looking at what the T0 IS announcing in bgp:
get bgp neighbor x.x.x.x advertised-routes
and conclude the NS static routes are not being announced. What makes sense, because you selected “static” and that are only the static routes seen from the T0. You should select NSX static. In the route table they show up as “ns” -> NSX connected.”

Thanks to everyone who commented!

Conclusion

There have been many changes in terminology used in NSX-T – especially for those versed in NSX-V – and understanding these changes can definitely help in deployment and troubleshooting.

I hope this scenario was helpful. If you have any questions or have suggestions for future scenarios, please feel free to leave a comment below or reach out to me on Twitter (@vswitchzero)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s