ECMP Path Determination in NSX

ECMP or ‘equal cost multi-pathing’ is a great routing feature that was introduced in NSX 6.1 several years ago. By utilizing multiple egress paths, ECMP allows for better use of network bandwidth for northbound destinations in an NSX environment. As many as eight separate ESG appliances can be used with ECMP – ideally on dedicated ESX hosts in an ‘edge cluster’.

Lab Setup

In my lab, I’ve configured a very simple topology with two ESXi hosts, each with an ESG appliance used for north/south traffic. The two ESGs are configured for ECMP operation:

ecmp-1

The diagram above is very high-level and doesn’t depict the underlying physical infrastructure or ESXi hosts, but should be enough for our purposes. BGP is used exclusively as the dynamic routing protocol in this environment.

Looking at the diagram, we can see that any VMs on the 172.17.1.0/24 network should have a DLR LIF as their default gateway. Because ECMP is being used, the DLR instance should in theory have an equal cost route to all northbound destinations via 172.17.0.10 and 172.17.0.11.

Let’s have a look at the DLR routing table:

dlr-a1.lab.local-0> sh ip route

Codes: O - OSPF derived, i - IS-IS derived, B - BGP derived,
C - connected, S - static, L1 - IS-IS level-1, L2 - IS-IS level-2,
IA - OSPF inter area, E1 - OSPF external type 1, E2 - OSPF external type 2,
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2

Total number of routes: 17

B 10.40.0.0/25 [200/0] via 172.17.0.10
B 10.40.0.0/25 [200/0] via 172.17.0.11
C 169.254.1.0/30 [0/0] via 169.254.1.1
B 172.16.10.0/24 [200/0] via 172.17.0.10
B 172.16.10.0/24 [200/0] via 172.17.0.11
B 172.16.11.0/24 [200/0] via 172.17.0.10
B 172.16.11.0/24 [200/0] via 172.17.0.11
B 172.16.12.0/24 [200/0] via 172.17.0.10
B 172.16.12.0/24 [200/0] via 172.17.0.11
B 172.16.13.0/24 [200/0] via 172.17.0.10
B 172.16.13.0/24 [200/0] via 172.17.0.11
B 172.16.14.0/24 [200/0] via 172.17.0.10
B 172.16.14.0/24 [200/0] via 172.17.0.11
B 172.16.76.0/24 [200/0] via 172.17.0.10
B 172.16.76.0/24 [200/0] via 172.17.0.11
C 172.17.0.0/26 [0/0] via 172.17.0.2
C 172.17.1.0/24 [0/0] via 172.17.1.1
C 172.17.2.0/24 [0/0] via 172.17.2.1
C 172.17.3.0/24 [0/0] via 172.17.3.1
C 172.17.4.0/24 [0/0] via 172.17.4.1
C 172.17.5.0/24 [0/0] via 172.17.5.1
B 172.19.7.0/24 [200/0] via 172.17.0.10
B 172.19.7.0/24 [200/0] via 172.17.0.11
B 172.19.8.0/24 [200/0] via 172.17.0.10
B 172.19.8.0/24 [200/0] via 172.17.0.11
B 172.31.255.0/26 [200/0] via 172.17.0.10
B 172.31.255.0/26 [200/0] via 172.17.0.11

As seen above, all of the BGP learned routes from northbound locations – 172.19.7.0/24 included – have both ESGs listed with a cost of 200. In theory, the DLR could use either of these two paths for northbound routing. But which path will actually be used?

Path Determination and Hashing

In order for ECMP to load balance across multiple L3 paths effectively, some type of a load balancing algorithm is required. Many physical L3 switches and routers use configurable load balancing algorithms including more complex ones based on a 5-tuple hash taking even source/destination TCP ports into consideration. The more potential criteria for for analysis by the algorithm, the more likely traffic will be well balanced.

NSX’s implementation of ECMP does not include a configurable algorithm, but rather keeps things simple by using a hash based on the source and destination IP address. This is very similar to the hashing used by static etherchannel bonds – IP hash as it’s called in vSphere – and is generally not very resource intensive to calculate. With a large number of source/destination IP address combinations, a good balance of traffic across all paths should be attainable.

A few years ago, I wrote an article for the VMware Support Insider blog on manually calculating the hash value and determining the uplink when using IP hash load balancing. The general concept used by NSX for ECMP calculations is pretty much the same. Rather than calculating an index value associated with an uplink in a NIC team, we calculate an index value associated with an entry in the routing table.

Obviously, the most simple method of determining the path used would be a traceroute from the source machine. Let’s do this on the win-a1.lab.local virtual machine:

C:\Users\Administrator>tracert -d 172.19.7.100

Tracing route to 172.19.7.100 over a maximum of 30 hops

 1 <1 ms <1 ms <1 ms 172.17.1.1
 2 4 ms <1 ms <1 ms 172.17.0.10
 3 1 ms 1 ms 1 ms 172.31.255.3
 4 2 ms 1 ms <1 ms 10.40.0.7
 5 4 ms 1 ms 1 ms 172.19.7.100

Trace complete.

The first hop, 172.17.1.1 is the DLR LIF address – the default gateway of the VM. The second hop, we can see is 172.17.0.10, which is esg-a1. Clearly the hashing algorithm picked the first of two possible paths in this case. If I try a different northbound address, in this case 172.19.7.1, which is a northbound router interface, we see a different result:

C:\Users\Administrator>tracert -d 172.19.7.1

Tracing route to 172.19.7.1 over a maximum of 30 hops

 1 <1 ms <1 ms <1 ms 172.17.1.1
 2 <1 ms <1 ms <1 ms 172.17.0.11
 3 1 ms 1 ms 1 ms 172.31.255.3
 4 2 ms 1 ms 1 ms 172.19.7.1

Trace complete.

This destination uses esg-a2. If you repeat the traceroute, you’ll notice that as long as the source and destination IP address remains the same, the L3 path up to the ESGs also remains the same.

Traceroute is well and good, but what if you don’t have access to SSH/RDP into the guest? Or if you wanted to test several hypothetical IP combinations?

Using net-vdr to Determine Path

Thankfully, NSX includes a special net-vdr option to calculate the expected path. Let’s run it through its paces.

Keep in mind that because the ESXi hosts are actually doing the datapath routing, it’s there that we’ll need to do this – not on the DLR control VM appliance. Since my source VM win-a1 is on host esx-a1, I’ll SSH there. It should also be noted that it really doesn’t matter which ESXi host you use to check the path. Because the DLR instance is the same across all configured ESXi hosts, the path selection is also the same.

First, we determine the name of the DLR instance using the net-vdr -I -l command:

[root@esx-a1:~] net-vdr -I -l

VDR Instance Information :
---------------------------

Vdr Name: default+edge-1
Vdr Id: 0x00001388
Number of Lifs: 6
Number of Routes: 16
State: Enabled
Controller IP: 172.16.10.43
Control Plane IP: 172.16.10.21
Control Plane Active: Yes
Num unique nexthops: 2
Generation Number: 0
Edge Active: No

In my case, I’ve got only one instance called default+edge-1. The ‘Vdr Name’ will include the tenant name, followed by a ‘+’ and then the edge-ID which is visible in the NSX UI.

Next, let’s take a look at the DLR routing table from the ESXi host’s perspective. Earlier we looked at this from the DLR control VM, but ultimately this data needs to make it to the ESXi host for routing to function. These BGP learned routes originated on the control VM, were sent to the NSX control cluster and then synchronized with ESXi via the netcpa agent.

[root@esx-a1:~] net-vdr -R -l default+edge-1

VDR default+edge-1 Route Table
Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]
Legend: [H: Host], [F: Soft Flush] [!: Reject] [E: ECMP]

Destination GenMask Gateway Flags Ref Origin UpTime Interface
----------- ------- ------- ----- --- ------ ------ ---------
10.40.0.0 255.255.255.128 172.17.0.10 UGE 1 AUTO 1189684 138800000002
10.40.0.0 255.255.255.128 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.16.10.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.16.10.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.16.11.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.16.11.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.16.12.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.16.12.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.16.13.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.16.13.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.16.14.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.16.14.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.16.76.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.16.76.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.17.0.0 255.255.255.192 0.0.0.0 UCI 1 MANUAL 1189684 138800000002
172.17.1.0 255.255.255.0 0.0.0.0 UCI 1 MANUAL 1189684 13880000000a
172.17.2.0 255.255.255.0 0.0.0.0 UCI 1 MANUAL 1189684 13880000000b
172.17.3.0 255.255.255.0 0.0.0.0 UCI 1 MANUAL 1189684 13880000000c
172.17.4.0 255.255.255.0 0.0.0.0 UCI 1 MANUAL 1189684 13880000000d
172.17.5.0 255.255.255.0 0.0.0.0 UCI 1 MANUAL 1189684 13880000000e
172.19.7.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.19.7.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.19.8.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.19.8.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1189684 138800000002
172.31.255.0 255.255.255.192 172.17.0.10 UGE 1 AUTO 1189684 138800000002
172.31.255.0 255.255.255.192 172.17.0.11 UGE 1 AUTO 1189684 138800000002

The routing table looks similar to the control VM, with a few exceptions. From this view, we don’t know where the routes originated – only if they are connected interfaces or a gateway. Just as we saw on the control VM, the ESXi host also knows that each gateway route has two equal cost paths.

Now let’s look at a particular net-vdr option called ‘resolve’:

[root@esx-a1:~] net-vdr –help
<snip>
--route -o resolve -i destIp [-M destMask] [-e srcIp] vdrName Resolve a route in a vdr instance
<snip>

Plugging in the same combination of source/destination IP I used in the first traceroute, I see that the net-vdr module agrees:

[root@esx-a1:~] net-vdr --route -o resolve -i 172.19.7.100 -e 172.17.1.100 default+edge-1

VDR default+edge-1 Route Table
Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]
Legend: [H: Host], [F: Soft Flush] [!: Reject] [E: ECMP]

Destination GenMask Gateway Flags Ref Origin UpTime Interface
----------- ------- ------- ----- --- ------ ------ ---------
172.19.7.0 255.255.255.0 172.17.0.10 UGE 1 AUTO 1190003 138800000002

As seen above, the output of the command identifies the specific route in the routing table that will be used for that source/destination IP address pair. As we confirmed in the traceroute earlier, esg-a1 (172.17.0.10) is used.

We can repeat the same command for the other destination IP address we used earlier to see if it selects esg-a2 (172.17.0.11):

[root@esx-a1:~] net-vdr --route -o resolve -i 172.19.7.100 -e 172.17.1.1 default+edge-1

VDR default+edge-1 Route Table
Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]
Legend: [H: Host], [F: Soft Flush] [!: Reject] [E: ECMP]

Destination GenMask Gateway Flags Ref Origin UpTime Interface
----------- ------- ------- ----- --- ------ ------ ---------
172.19.7.0 255.255.255.0 172.17.0.11 UGE 1 AUTO 1190011 138800000002

And indeed it does.

What About Ingress Traffic?

NSX’s implementation of ECMP is applicable to egress traffic only. The path selection done northbound of the ESGs would be at the mercy of the physical router or L3 switch performing the calculation. That said, you’d definitely want ingress traffic to also be balanced for efficient utilization of ESGs in both directions.

Without an appropriate equal cost path configuration on the physical networking gear, you may find that all return traffic or ingress traffic uses only one of the available L3 paths.

In my case, I’m using a VyOS routing appliance just northbound of the edges called router-core.

vyos@router-core1:~$ sh ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
 I - ISIS, B - BGP, > - selected route, * - FIB route

C>* 10.40.0.0/25 is directly connected, eth0.40
C>* 127.0.0.0/8 is directly connected, lo
B>* 172.16.10.0/24 [20/1] via 172.31.255.1, eth0.20, 02w0d01h
B>* 172.16.11.0/24 [20/1] via 172.31.255.1, eth0.20, 02w0d01h
B>* 172.16.12.0/24 [20/1] via 172.31.255.1, eth0.20, 02w0d01h
B>* 172.16.13.0/24 [20/1] via 172.31.255.1, eth0.20, 02w0d01h
B>* 172.16.14.0/24 [20/1] via 172.31.255.1, eth0.20, 02w0d01h
B>* 172.16.76.0/24 [20/1] via 172.31.255.1, eth0.20, 02w0d01h
B>* 172.17.0.0/26 [20/0] via 172.31.255.10, eth0.20, 02w0d00h
 * via 172.31.255.11, eth0.20, 02w0d00h
B>* 172.17.1.0/24 [20/0] via 172.31.255.10, eth0.20, 02w0d00h
 * via 172.31.255.11, eth0.20, 02w0d00h
B>* 172.17.2.0/24 [20/0] via 172.31.255.10, eth0.20, 02w0d00h
 * via 172.31.255.11, eth0.20, 02w0d00h
B>* 172.17.3.0/24 [20/0] via 172.31.255.10, eth0.20, 02w0d00h
 * via 172.31.255.11, eth0.20, 02w0d00h
B>* 172.17.4.0/24 [20/0] via 172.31.255.10, eth0.20, 02w0d00h
 * via 172.31.255.11, eth0.20, 02w0d00h
B>* 172.17.5.0/24 [20/0] via 172.31.255.10, eth0.20, 02w0d00h
 * via 172.31.255.11, eth0.20, 02w0d00h
B>* 172.19.7.0/24 [20/1] via 10.40.0.7, eth0.40, 02w0d01h
B>* 172.19.8.0/24 [20/1] via 10.40.0.7, eth0.40, 01w6d20h
C>* 172.31.255.0/26 is directly connected, eth0.20

As you can see above, this router is also allowing multiple equal cost paths. We see all the southbound networks learned by BGP twice in the routing table. This was achieved by simply configuring bgp for a ‘maximum paths’ value of greater than ‘1’:

vyos@router-core1:~$ sh configuration
<snip>
protocols {
 bgp 64512 {
 maximum-paths {
 ebgp 4
 ibgp 4
 }

I’m honestly not sure what load balancing algorithm VyOS implements, but from an NSX perspective, it doesn’t really matter. It doesn’t have to match, it simply needs to balance traffic across each of the available L3 paths. As long as an ingress packet arrives at one of the ESGs, it’ll know how to route it southbound.

So there you have it. There is obviously a lot more to ECMP than what I discussed in this post, but hopefully this helps to clarify a bit about path selection.

Finding the NSX VIB Download URL

Although in most cases, it’s not necessary to obtain the NSX VIBs, there are some situations where you’ll need them. Most notably is when incorporating the VIBs into an image profile used for stateless auto-deploy ESXi hosts.

In older versions of NSX, including 6.0.x and 6.1.x, it used to be possible to use the very simple URL format as follows:

https://<nsxmanagerIP>/bin/vdn/vibs/<esxi-version>/vxlan.zip

For example, if you wanted the NSX VIBs for ESXi 5.5 hosts from an NSX manager with IP address 192.168.0.10, you’d use the following URL:

https://192.168.0.10/bin/vdn/vibs/5.5/vxlan.zip

This URL changed at some point, I expect with the introduction of NSX 6.2.x. VMware now uses a less predictable path, including build numbers. To get the exact locations for your specific version of NSX, you can visit the following URL:

https://<nsxmanagerIP>/bin/vdn/nwfabric.properties

When visiting this URL, you’ll be greeted by several lines of text output, which includes the path to the VIBs for various versions of ESXi. Below is some sample output you’ll see with NSX 6.3.2:

# 5.5 VDN EAM Info
VDN_VIB_PATH.1=/bin/vdn/vibs-6.3.2/5.5-5534162/vxlan.zip
VDN_VIB_VERSION.1=5534162
VDN_HOST_PRODUCT_LINE.1=embeddedEsx
VDN_HOST_VERSION.1=5.5.*

# 6.0 VDN EAM Info
VDN_VIB_PATH.2=/bin/vdn/vibs-6.3.2/6.0-5534166/vxlan.zip
VDN_VIB_VERSION.2=5534166
VDN_HOST_PRODUCT_LINE.2=embeddedEsx
VDN_HOST_VERSION.2=6.0.*

# 6.5 VDN EAM Info
VDN_VIB_PATH.3=/bin/vdn/vibs-6.3.2/6.5-5534171/vxlan.zip
VDN_VIB_VERSION.3=5534171
VDN_HOST_PRODUCT_LINE.3=embeddedEsx
VDN_HOST_VERSION.3=6.5.*

# Single Version associated with all the VIBs pointed by above VDN_VIB_PATH(s)
VDN_VIB_VERSION=6.3.2.5672532

Legacy vib location. Used by code to discover avaialble legacy vibs.
LEGACY_VDN_VIB_PATH_FS=/common/em/components/vdn/vibs/legacy/
LEGACY_VDN_VIB_PATH_WEB_ROOT=/bin/vdn/vibs/legacy/

So as you can see above, the VIB path for ESXi 6.5 for example would be:

VDN_VIB_PATH.3=/bin/vdn/vibs-6.3.2/6.5-5534171/vxlan.zip

And to access this as a URL, you’d simply tack this path onto the normal NSX https location as follows:

https://<nsxmanagerIP>/bin/vdn/vibs-6.3.2/6.5-5534171/vxlan.zip

The file itself is about 17MB in size in 6.3.2. Despite being named vxlan.zip, it actually contains two VIBs – both the VSIP module for DFW and message bus purposes as well as the VXLAN module for logical switching and routing.

NSX 6.2.8 Released!

It’s always an exciting time at VMware when a new NSX build goes ‘GA’. Yesterday (July 6th) marks the official release of NSX-V 6.2.8.

NSX 6.2.8 is a maintenance or patch release focused mainly on bug fixes, and there are quite a few in this one. You can head over to the release notes for the full list, but I’ll provide a few highlights below.

Before I do that, here are the relevant links:

In my opinion, some of the most important fixes in 6.2.8 include the following:

Fixed Issue 1760940: NSX Manager High CPU triggered by many simultaneous vMotion tasks

This was a fairly common issue that we would see in larger deployments with large numbers of dynamic security groups. The most common workflow that would trigger this would be putting a host into maintenance mode, triggering a large number of simultaneous vMotions. I’m happy to see that this one was finally fixed. Unfortunately, it doesn’t seem that this has yet been corrected in any of the 6.3.x releases, but I’m sure it will come. You can find more information in VMware KB 2150668.

Fixed Issue 1854519: VMs lose North to south connectivity after migration from a VLAN to a bridged VXLAN

This next one is not quite so common, but I’ve personally seen a couple of customers hit this. If you have a VM in a VLAN network, and then move it to the VXLAN dvPortgroup associated with the bridge, connectivity is lost. This happens because a RARP doesn’t get sent to update the physical switch’s MAC table (VMware often uses RARP instead of GARP for this purpose). Most customers would use the VLAN backed half of the bridged network for physical devices, and not for VMs, but there is no reason why this shouldn’t work.

Fixed Issue 1849037: NSX Manager API threads get exhausted when communication link with NSX Edge is broken

I think this one is pretty self explanatory – if the manager can’t process API calls, a lot of the manager’s functionality is lost. There are a number of reasons an ESXi host could lose its communication channel to the NSX manager, so definitely a good fix.

Fixed Issue 1813363: Multiple IP addresses on same vNIC causes delays in firewall publish operation

Another good fix that should help to reduce NSX manager CPU utilization and improve scale. Multiple IPs on vNICs is fairly common to see.

Fixed Issue 1798537: DFW controller process on ESXi (vsfwd) may run out of memory

I’ve seen this issue a few times in very large micro segmentation environments, and very happy to see it fixed. This should certainly help improve stability and environment scale.

FreeNAS Power Consumption and ACPI

As I alluded to in part 4 and part 5 of my recent ‘FreeNAS 9.10 Lab Build’ series, I could achieve better power consumption figures after enabling deeper C states. I first noticed this a year or two ago when I had re-purposed a Shuttle SH67H3 cube PC for use with FreeNAS. I was familiar with the expected power draw of this system when using it with other operating systems in the past, but with FreeNAS installed, it seemed higher than it should have been.

After doing some digging on the subject, I came across a thread on the FreeNAS forums describing the default ACPI C-state used and information on how to modify it.

A Bit of Background on ACPI C-States

What are often referred to as ‘C-States’ are basically levels of CPU power savings defined by the ACPI (Advanced Configuration and Power Interface) standard. This standard defines states for other devices in the system as well – not just CPUs – but all states prefixed by a ‘C’ refer to CPU power states.

The states supported by a system will often vary depending on the age of the system and the CPU manufacturer, but most modern Intel based systems will support ACPI states C0, C1, C2 and C3. Some older systems only supported C1 and a lower power state called C1E.

C0 is essentially the state where the CPU is completely awake, operating at its full frequency and performance potential and actually executing instructions. The higher the ‘C’ value, the deeper into power savings or sleep modes the CPU can go.

All ACPI compliant systems must support another state called C1. In the C1 state, the CPU is basically at idle and isn’t executing any instructions. The key requirement for the C1 state is that the CPU must be able to return to the C0 state to execute instructions immediately without any latency or delay. Because of this requirement, there are only so many power saving tweaks that the CPU can implement.

This is where the C2 state comes in, also known as ‘Stop-Clock’. In this ACPI idle state, additional power savings features can be used. This is where modern Intel processors shine. They can do all sorts of power saving wizardry, like turning off unused portions of the CPU or even entire idle CPU cores. Because there can be a very slight delay in returning to the C0 state when using these these features, they cannot be implemented when limited to ACPI C1.

Enabling Deeper ACPI C-States

By default, FreeNAS has ACPI configured to use only the C1 power state. Presumably, this is to guarantee maximum performance and prevent any quirks with older CPUs switching between power states.

If maximum performance and performance consistency is desired over power savings – as is often the case in critical production environments – leaving the default at C1 is probably a wise choice. This choice also becomes much less important if your system is heavily utilized and spends very little time at idle anyway. But if you are like me, running a home lab, idle power consumption is an important consideration.

From an SSH or console prompt, you can determine the detected and supported C-States by querying the relevant ACPI sysctls:

[root@freenas] ~# sysctl -a | grep cx_
hw.acpi.cpu.cx_lowest: C1
dev.cpu.3.cx_usage: 100.00% 0.00% last 19us
dev.cpu.3.cx_lowest: C1
dev.cpu.3.cx_supported: C1/1/1 C2/3/96
dev.cpu.2.cx_usage: 100.00% 0.00% last 136us
dev.cpu.2.cx_lowest: C1
dev.cpu.2.cx_supported: C1/1/1 C2/3/96
dev.cpu.1.cx_usage: 100.00% 0.00% last 2006us
dev.cpu.1.cx_lowest: C1
dev.cpu.1.cx_supported: C1/1/1 C2/3/96
dev.cpu.0.cx_usage: 100.00% 0.00% last 2101us
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_supported: C1/1/1 C2/3/96

As you can see above, my Xeon X3430 supports C1, C2 and C3 states. The cx_usage value appears to report how much time the processor spends in that particular state and the cx_lowest reports the deepest allowed state.

In the C1 state, my system is idling at about 95W total power consumption.

I had originally suspected that Intel SpeedStep technology (a feature that provides dynamic frequency and voltage scaling at idle) was not functioning in ACPI C1, but that doesn’t seem to be the case. My CPU’s normal (non turbo-boost) frequency is 2.4GHz. If SpeedStep is functional, I’d expect it to use one of the following lower power states as defined in the following sysctl:

[root@freenas] ~# sysctl -a | grep dev.cpu.0.freq_levels
dev.cpu.0.freq_levels: 2395/95000 2394/95000 2261/78000 2128/63000 1995/57000 1862/46000 1729/36000 1596/32000 1463/25000 1330/19000 1197/17000

As seen above, the processor should be able to scale down to 1197MHz at idle. Even with the powerd daemon stopped, you can still use the powerd command line tool to see what the current CPU frequency is as well as any changes as load increases. Using powerd with the -v verbose option, we can see that the processor frequency does indeed jump up and down in the ACPI C1 state and stabilizes at 1197MHz when idle:

[root@freenas] ~# powerd -v
<snip>
changing clock speed from 1463 MHz to 1330 MHz
load   4%, current freq 1330 MHz ( 9), wanted freq 1263 MHz
load   0%, current freq 1330 MHz ( 9), wanted freq 1223 MHz
load   0%, current freq 1330 MHz ( 9), wanted freq 1197 MHz
changing clock speed from 1330 MHz to 1197 MHz
load  26%, current freq 1197 MHz (10), wanted freq 1197 MHz
load   3%, current freq 1197 MHz (10), wanted freq 1197 MHz
load   0%, current freq 1197 MHz (10), wanted freq 1197 MHz
load   4%, current freq 1197 MHz (10), wanted freq 1197 MHz

You can change the lowest allowed ACPI state by using the following command. In this example, I will allow the system to use more advanced power saving features by setting cx_lowest to C2:

[root@freenas] ~# sysctl hw.acpi.cpu.cx_lowest=C2
hw.acpi.cpu.cx_lowest: C1 -> C2

After making this change, the system power consumption immediately dropped down to about 75W. That’s more than 20% – not bad!

Now if we repeat the previous command, we can see some different reporting:

[root@freenas] ~# sysctl -a | grep cx_
hw.acpi.cpu.cx_lowest: C2
dev.cpu.3.cx_usage: 0.00% 100.00% last 19695us
dev.cpu.3.cx_lowest: C2
dev.cpu.3.cx_supported: C1/1/1 C2/3/96
dev.cpu.2.cx_usage: 5.12% 94.87% last 23us
dev.cpu.2.cx_lowest: C2
dev.cpu.2.cx_supported: C1/1/1 C2/3/96
dev.cpu.1.cx_usage: 1.87% 98.12% last 255us
dev.cpu.1.cx_lowest: C2
dev.cpu.1.cx_supported: C1/1/1 C2/3/96
dev.cpu.0.cx_usage: 1.28% 98.71% last 1200us
dev.cpu.0.cx_lowest: C2
dev.cpu.0.cx_supported: C1/1/1 C2/3/96

Part of the reason the power savings was so significant is because the system is spending over 95% of its time at idle in the C2 state.

Making it Stick

One thing you’ll notice is that after rebooting, this change will revert back to the default again. Since this is a sysctl, you’d think this could just be added as a system tunable parameter in the UI. I tried this to no avail.

After doing some digging, I found a bug reported on this. It appears that this is a known problem due to other rc.d scripts interfering with the cx_lowest sysctl.

I took a look in the /etc/rc.d directory and found a script called power_profile. It does indeed appear to be overwriting the ACPI lowest state at bootup:

[root@freenas] ~# cat /etc/rc.d/power_profile
<snip>
# Set the various sysctls based on the profile's values.
node="hw.acpi.cpu.cx_lowest"
highest_value="C1"
lowest_value="Cmax"
eval value=\$${profile}_cx_lowest
sysctl_set
<snip>

I could probably get by this issue by modifying the startup scripts, but the better solution is to simply add the required command to the list of post-init scripts. This can be done from the UI in the following location:

freenas-acpi-1

They key thing is to ensure is that the command is run ‘post init’. When that is done, the setting sticks and is applied after the power_profile script.

Hopefully this could save you a few dollars on your power bill as well!

FreeNAS 9.10 Lab Build – Part 5

In part 4 of this series, I took a look at my new used Dell PowerEdge T110 and talked about the pros and cons about using this type of machine. Today, I’ll be installing the drives and completing the build.

freenas5-3

To begin, I installed my drives into the server’s normal 3.5″ mounting locations. I had a few challenges here, but I was very thankful that I kept the Dell branded SATA cable that came with the PERC H200 card. It’s totally meant for this machine and keeps the wiring organized. In all of the custom builds I’ve done, it’s really tough to get clean SATA-power wiring because of the close proximity of the drives. Because of this, there is often too much pressure on the connectors and I’ve even had issues with them coming loose. This Dell breakout cable is very flexible and because of the built-in power connector, makes for a very clean and secure wiring job.

Unfortunately, I didn’t need to buy the two SATA breakout cables that I discussed in part 1 of the series, but at least I’ve got extras if I ever want to add more drives to the system.

freenas5-2

To get the two 2.5″ SATA drives installed, I used a couple of Kingston brand 2.5″ to 3.5″ adapters. They were a perfect fit for the drive caddies I bought off of ebay, but they interfered with the Dell SATA brakout cable connector. To get around this, I just loosened the screws securing the SSDs and the cable connector to get a couple of extra millimeters of clearance for the connector. Eventually, I hope to add a hotswap enclosure to one of the 5.25″ drive bays for 2.5″ drives, but for now this will have to do.

Continue reading “FreeNAS 9.10 Lab Build – Part 5”

Refurbishing Old Keyboards

Today I’m going to shift gears a bit and take a look at some retro PC keyboards.

keyboard1-0

I’m currently in the process of refurbishing a couple of old PCs, including a classic DEC brand 486 DX2 and a newer 2001 era AMD Athlon system. A lot of people see these machines as trash, but for me there is a lot of nostalgia around systems like this and I see value in trying to restore and maintain them.

One of the first problems I ran into when trying to test out these systems is that not all keyboards were created equally. There are several connector standards that have been used over the years including AT, PS2 and more recently, USB. I have always kept a couple of USB to PS2 adapters for occasions such as this, but to my surprise, three USB keyboards I tried were simply not functional when using the PS2 adapter. As it turns out, there needs to be some intelligence in the keyboard itself to know when its been plugged into a PS2 port in order to function based on that signalling standard. The only keyboard I could get to work with the adapter was my trusty DAS Keyboard, that I was not willing to move away from my main system.

Continue reading “Refurbishing Old Keyboards”

FreeNAS 9.10 Lab Build – Part 4

In part 4 of this series, I’ll be taking a look at my new (used) Dell PowerEdge T110 server and sizing it up for use with FreeNAS 9.10.

On my daily perusal through my favorite eBay seller’s inventory, I came across a well used and somewhat scratched up Dell T110 tower server. It probably didn’t garner a lot of interest with its meager 1GB of RAM, lack of hard drives and rough appearance. Despite this, the seller’s asking price of $99 wasn’t bad when you consider that is had a Xeon X3430 quad core processor and was tested and functional.

Since I already had a working Perc H200 card – an optional and supported card in the Dell T110 – as well as a pair of 4GB 2Rx8 ECC DIMMs collecting dust, this box was suddenly appealing to me.

The next day, I saw the price drop to $75, and almost pulled trigger on it until I saw the hefty shipping cost. I didn’t think it would last long at that price, but decided to make an offer of $63 to offset the shipping a bit and was pleased to find that the seller accepted.

It needed a lot of TLC when it arrived. I spent a fair amount of time removing caked-on dust from the fan assembly and also spent some quality time with a can of compressed air.

Continue reading “FreeNAS 9.10 Lab Build – Part 4”

FreeNAS 9.10 Lab Build – Part 3

In Part 2 of my FreeNAS 9.10 build series, I installed my newly flashed Dell PERC H200 LSI 9211-8i card into my primary management host to try out FreeNAS as a VM with PCI passthrough.

After opening the case to do some rewiring I was totally shocked at how hot the little aluminum heatsink on the card was. This thing wasn’t just hot – it was scalding hot. Hot enough to invoke the subconscious and instinctive reaction pull your finger away from the heatsink. If I had to guess, I’d say that heatsink was at least 90’C or hotter.

Although it was surprising, I had to remind myself that this adapter is not a consumer grade card and does have an airflow requirement for it to run reliably. Dell’s H200 user guide says very little about the cooling requirements, unfortunately, but some of the Dell PowerEdge servers that support the H200 classify it as a ‘25W TDP’ card. That may not sound like a lot of thermal output, but when you consider most of that is concentrated on an area the size of a dime, it’s a lot to dissipate.

In most rackmount type cases, a minimum amount of front-to-back airflow is usually directed across all of the PCI-Express slots, but in my large Phanteks Enthoo Pro, there happens to be very little in this area with the default fan configuration.

After perusing around online, I could see that this was not an uncommon problem among SAS 2008 based adapters – including the popular IBM M1015 – and that in some circumstances the card could even overheat under heavy load. Considering how hot mine was while sitting idle, I can only imagine.

Even though I’ll only be using this card in a lab, I do want to ensure it runs reliably and lives a long life. It seemed that there were a few possible solutions I could pursue:

  1. Add a small 40mm fan to the heatsink.
  2. Replace the tiny heatsink with something larger.
  3. Find a way to increase the airflow in this area of the case.

The third option appealed to me most – mainly because I hate small fans. They usually have an annoying ‘whine’ to them as they need to spin at a higher RPM and they can be pretty unreliable. I’d also expect the width of the fan to block the adjoining PCI-Express slot as well.

So after taking a look through my spare parts, I came across an old Noctua 92mm PWM fan collecting dust. Although I think Noctua’s marketing is a bit over the top, I have been using their fans for many years and can attest to their high quality and quiet operation.

After MacGyver’ing a couple of thumbscrews and metal brackets I could get the fan into the perfect position. Also, because it’s a PWM modulated fan, it spins down with the rest of the system fans and is pretty much inaudible at under 1000RPM.

Even though there feels like there is barely any airflow coming from the Noctua NF-B9 fan at reduced RPM, it’s enough to dissipate the hot air from around the heatsink fins and the heatsink is now only warm to the touch! It really did make a huge difference.

Problem solved. Hopefully whatever case I ultimately use for my FreeNAS build will not have these sorts of airflow dead spots, but at least there could be a simple solution.

FreeNAS 9.10 Lab Build Series:

Part 1 – Defining the requirements and flashing the Dell PERC H200 SAS card.
Part 2 – FreeNAS and VMware PCI passthrough testing.
Part 3 – Cooling the toasty Dell PERC H200.
Part 4 – A close look at the Dell PowerEdge T110.
Part 5 – Completing the hardware build.

Beacon Probing Deep Dive

Today I’ll be looking at a feature I’ve wanted to examine for some time – Beacon Probing. I hope to take a fresh look at this often misunderstood feature, explore the pros, cons, quirks and take a bit of a technical deep-dive into its inner workings.

According to the vSphere Networking Guide, we see that Beacon Probing is one of two available NIC failure detection mechanisms. Whenever we’re dealing with a team of two or more NICs, ESXi must be able to tell when a network link is no longer functional so that it can fail-over all VMs or kernel ports to the remaining NICs in the team.

Beacon Probing

Beacon probing takes network failure detection to the next level. As you’ve probably already guessed, it does not rely on NIC link-state to detect a failure. Let’s have a look at the definition of Beacon Probing in the vSphere 6.0 Network guide on page 92:

“[Beacon Probing] sends out and listens for beacon probes on all NICs in the team and uses this information, in addition to link status, to determine link failure.”

This statement sums up the feature very succinctly, but obviously there is a lot more going on behind the scenes. How do these beacons work? How often are they sent out? Are they broadcast or unicast frames? What do they look like? How do they work when multiple VLANs are trunked across a single link? What are the potential problems when using beacon probing?

Today, we’re going to answer these questions and hopefully give you a much better look at how beacon probing actually works.

Continue reading “Beacon Probing Deep Dive”

FreeNAS 9.10 Lab Build – Part 2

In Part 1 of this series, I discussed building a proper FreeNAS server and prepared a Dell PERC H200 by flashing it to an LSI 9211-8i in IT mode. But while I was looking around for suitable hardware for the build, I decided to try something that I’ve wanted to do for a long time – PCI passthrough.

This would give me an opportunity to tinker with vt-d passthrough and put my freshly flashed Dell PERC H200 through its paces.

Why Not VMDK Disks?

As mentioned in Part 1 of this series, FreeNAS makes use of ZFS, which is much more than just a filesystem. It combines the functionality of a logical volume manager and an advanced filesystem providing a whole slew of features including redundancy and data integrity. For it to do this effectively – and safely – ZFS needs direct access to SATA or SAS drives. We want ZFS to manage all aspects of the drives and the storage pool and should remove all layers of abstraction between the FreeNAS OS and the drives themselves.

As you probably know, FreeNAS works well enough as a virtual machine for lab purposes. After all, ignoring what’s in between, ones and zeros still make it to from FreeNAS to the disks. That said, using a virtual SCSI adapter and VMDK disks certainly does not qualify as ‘direct access’. In fact, the data path would be packed with layers of abstraction and would look something like this:

Physical Disk > < SATA/SAS HBA > < ESXi Hypervisor HBA driver > < VMFS 5 Filesystem > < VMDK virtual disk > < Virtual SCSI Adapter > < FreeNAS SCSI driver > < FreeNAS/FreeBSD

In contrast, a physical FreeNAS server would look more like:

Physical Disk > < SATA/SAS HBA > < FreeNAS HBA driver > < FreeNAS/FreeBSD

 

Continue reading “FreeNAS 9.10 Lab Build – Part 2”