In this post, I want to discuss how to verify Virtual Gateway forwarding behaviour on Broadcom based Juniper QFX switches.
The general assumption with EVPN Anycast Gateway is that gateway flows are load-balanced across all gateway devices. And whilst EVPN provides the mechanism to support this behaviour, there is a requirement for the forwarding hardware to also support it.
The mechanism for an EVPN device to load balance gateway flows is to install the virtual gateway ESI as a next-hop for the virtual gateway MAC address. However, Broadcom based QFX switches do not support this behaviour and can only install a single VTEP as a next-hop. So this means that traffic flows heading towards the virtual gateway will only ever traverse via a single gateway device. This behaviour is well documented and there are some talks about Broadcom working with the vendors to improve gateway load-balancing with ESI functionality.
Now we understand the characteristics, let’s look at the steps to verify forwarding behaviour on a Broadcom based QFX switch. Here we’ll look at how to identify which VTEP is being used to reach the virtual-gateway MAC address and how the underlay is transporting the traffic.
Lab Setup
The lab setup is a typical EVPN-VXLAN fabric with central routing on Core1 and Core2. The leaf switches are a mix of QFX5100 (Trident II) and QFX5110 (Trident II+) devices. I’ve used physical hardware for this lab as the behaviour of the vQFX is different. The vQFX emulates the Q5 PFE,Β thus supporting ESI for forwarding.
Step 1 – Identify Virtual Gateway MAC address
It’s good practice to manually set the virtual gateway MAC address and ESI for the IRB gateway. This makes verification and troubleshooting easier as you can quickly identify the values.
In this example, we will study a flow from S1-101, via LEAF1, to S3-102.
The Ethernet switching table on LEAF1 shows the VGW MAC is 00:00:5e:00:01:01 for VLAN101. We can also see the ESI value is 00:00:00:66:66:00:00:00:01:01.
lab@LEAF1-QFX5110> show ethernet-switching table vlan-id 101
MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
SE - statistics enabled, NM - non configured MAC, R - remote PE MAC, O - ovsdb MAC)
Ethernet switching table : 6 entries, 6 learned
Routing instance : default-switch
Vlan MAC MAC Logical Active
name address flags interface source
VLAN101 00:00:5e:00:01:01 DR esi.1775 00:00:00:66:66:00:00:00:01:01
VLAN101 64:64:9b:d6:68:20 D vtep.32770 10.0.255.11
VLAN101 80:71:1f:c6:26:f0 D vtep.32769 10.0.255.12
VLAN101 f8:c0:01:cb:ab:03 DL ae0.0
VLAN101 f8:c0:01:cb:ab:05 D vtep.32772 10.0.255.3
VLAN101 f8:c0:01:cb:ab:06 D vtep.32773 10.0.255.4
Step 2 – Verify underlay forwarding table for Virtual Gateway MAC
The extensive output details the forwarding path used to reach the VGA MAC on LEAF1. In this instance, the path via CORE1 is used. However, this doesn’t necessarily mean that CORE1 is the active gateway. This can be quite misleading at times.
lab@LEAF1-QFX5110> show route forwarding-table family ethernet-switching destination 00:00:5e:00:01:01 extensive
Routing table: default-switch.bridge [Index 6]
Bridging domain: VLAN101.bridge [Index 3]
VPLS:
Enabled protocols: Bridging, ACKed by all peers,
Destination: 00:00:5e:00:01:01/48
Learn VLAN: 0 Route type: user
Route reference: 0 Route interface-index: 559
Multicast RPF nh index: 0
P2mpidx: 0
IFL generation: 136 Epoch: 0
Sequence Number: 1 Learn Mask: 0x4000000000000000030000000000000000000000
L2 Flags: control_dyn
Flags: sent to PFE
Nexthop:
Next-hop type: composite Index: 1747 Reference: 19
Next-hop type: indirect Index: 131075 Reference: 3
Nexthop: 10.0.66.0
Next-hop type: unicast Index: 1709 Reference: 9
Next-hop interface: ge-0/0/44.0
Step 3 – Verify PFE forwarding table for Virtual Gateway MAC
Next up, we check the PFE forwarding table on LEAF1. This output details the VTEP that has been installed to reach the VGW MAC address. Below we can see that vtep.32770 is used to reach VLAN101 VGW MAC 00:00:5e:00:01:01.
lab@LEAF1-QFX5110> request pfe execute command "show l2 manager mac-table" target fpc0
SENT: Ukern command: show l2 manager mac-table
route table name : default-switch.6
mac counters
maximum count
0 24
mac table information
mac address BD learn Entry entry hal hardware info
Index vlan Flags ifl ifl pfe mask ifl
----------------------------------------------------------------------------------
00:00:5e:00:01:01 3 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
64:64:9b:d6:68:20 3 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
80:71:1f:c6:26:f0 3 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
f8:c0:01:cb:ab:03 3 0 0x0814 ae0.0 ae0.0 0 0x1 ae0.0
f8:c0:01:cb:ab:05 3 0 0x0014 vtep.32772 vtep.32772 0 0x1 vtep.32772
f8:c0:01:cb:ab:06 3 0 0x0014 vtep.32773 vtep.32773 0 0x1 vtep.32773
00:00:5e:00:01:02 4 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
64:64:9b:d6:68:20 4 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
80:71:1f:c6:26:f0 4 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
f8:c0:01:cb:ab:03 4 0 0x0814 ae0.0 ae0.0 0 0x1 ae0.0
f8:c0:01:cb:ab:05 4 0 0x0014 vtep.32772 vtep.32772 0 0x1 vtep.32772
f8:c0:01:cb:ab:06 4 0 0x0014 vtep.32773 vtep.32773 0 0x1 vtep.32773
Note. If you’re using a QFX10K, vQFX, or an MX based device, then you will an ESI value in place of vtep.32771.
Step 4 – Verify Virtual Gateway
Lastly, we verify the VTEP that is used to reach the VGW MAC 00:00:5e:00:01:01 for VLAN101 on LEAF1. VTEP.32770 is used to reach the VGW MAC.
The output confirms that 10.0.255.11, which is CORE1, is used for all gateway flows.
lab@LEAF1-QFX5110> show interfaces vtep.32770
Logical interface vtep.32770 (Index 559) (SNMP ifIndex 579)
Flags: Up SNMP-Traps Encapsulation: ENET2
VXLAN Endpoint Type: Remote, VXLAN Endpoint Address: 10.0.255.11, L2 Routing Instance: default-switch, L3 Routing Instance: default
Input packets : 1310
Output packets: 3401
Protocol eth-switch, MTU: Unlimited
Flags: Trunk-Mode
Based on the above output, the flow from S1-101 to S3-102 looks like this:
Now let’s look at a scenario where the primary path to CORE1 (VTEP.32770) is lost on LEAF1.
Disable ge-0/0/44 on LEAF1
{master:0}[edit]
lab@LEAF1-QFX5110# set interfaces ge-0/0/44 disable
Verify the underlay forwarding table for Virtual Gateway MAC has changed. Here we can see that the underlay is now using ge-0/0/46 to reach the VGA MAC.
lab@LEAF1-QFX5110> show route forwarding-table family ethernet-switching destination 00:00:5e:00:01:01 extensive
Routing table: default-switch.bridge [Index 6]
Bridging domain: VLAN101.bridge [Index 3]
VPLS:
Enabled protocols: Bridging, ACKed by all peers,
Destination: 00:00:5e:00:01:01/48
Learn VLAN: 0 Route type: user
Route reference: 0 Route interface-index: 559
Multicast RPF nh index: 0
P2mpidx: 0
IFL generation: 136 Epoch: 0
Sequence Number: 1 Learn Mask: 0x4000000000000000030000000000000000000000
L2 Flags: control_dyn
Flags: sent to PFE
Nexthop:
Next-hop type: composite Index: 1747 Reference: 19
Next-hop type: indirect Index: 131075 Reference: 3
Nexthop: 10.0.66.8
Next-hop type: unicast Index: 1756 Reference: 16
Next-hop interface: ge-0/0/46.0
Let’s now check the PFE forwarding table.
lab@LEAF1-QFX5110> request pfe execute command "show l2 manager mac-table" target fpc0
SENT: Ukern command: show l2 manager mac-table
route table name : default-switch.6
mac counters
maximum count
0 24
mac table information
mac address BD learn Entry entry hal hardware info
Index vlan Flags ifl ifl pfe mask ifl
----------------------------------------------------------------------------------
00:00:5e:00:01:01 3 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
64:64:9b:d6:68:20 3 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
80:71:1f:c6:26:f0 3 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
f8:c0:01:cb:ab:03 3 0 0x0814 ae0.0 ae0.0 0 0x1 ae0.0
f8:c0:01:cb:ab:05 3 0 0x0014 vtep.32772 vtep.32772 0 0x1 vtep.32772
f8:c0:01:cb:ab:06 3 0 0x0014 vtep.32773 vtep.32773 0 0x1 vtep.32773
00:00:5e:00:01:02 4 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
64:64:9b:d6:68:20 4 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
80:71:1f:c6:26:f0 4 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
f8:c0:01:cb:ab:03 4 0 0x0814 ae0.0 ae0.0 0 0x1 ae0.0
f8:c0:01:cb:ab:05 4 0 0x0014 vtep.32772 vtep.32772 0 0x1 vtep.32772
f8:c0:01:cb:ab:06 4 0 0x0014 vtep.32773 vtep.32773 0 0x1 vtep.32773
00:00:5e:00:01:03 5 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
Note that the VTEP used to reach the VGA MAC has not changed – vtep.32770 is still used. So this means that traffic is routing via CORE2 and across the core link to CORE1. Many of my customers are surprised by this behavior as it effectively introduces an element of suboptimal routing. However, generally speaking, this not really an issue unless you are stretching layer 2 across multiple DCs, particularly when the gateway selected is located in a remote DC. Note. Juniper has recently introduced functionality for EVPN route filtering in JUNOS 19.4R1. So this can be used to filter remote DC VGA MACs etc.
Load-balancing Hash (Updated)
I often get asked how a Broadcom based QFX switch will select a VTEP for gateway. Junos uses a hashing algorithm, based on a number of variables, to install a next-hop gateway on a per-VNI basis. I’m not sure if I can share the specific hashing detail here, due to NDA, but I’m checking.
I added a few more VNIs to the setup and observed some interesting behaviour. Initially, all the new VNIs selected the other Core for GW. I then rebooted the leaf switch following which all VNIs selected a single GW. I need to spend a little more time looking at the hashing algorithm to better understand what’s happening.
Before reboot:
lab@LEAF1-QFX5110> request pfe execute command "show l2 manager mac-table" target fpc0 | match 00:00:5e
00:00:5e:00:01:01 2 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
00:00:5e:00:01:02 3 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
00:00:5e:00:01:03 4 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
00:00:5e:00:01:04 5 0 0x0014 vtep.32770 vtep.32770 0 0x1 vtep.32770
00:00:5e:00:02:09 7 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:01 8 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:02 9 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:03 10 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:04 11 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:05 12 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:06 13 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:07 14 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:08 15 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
After reboot:
lab@LEAF1-QFX5110> request pfe execute command "show l2 manager mac-table" target fpc0 | match 00:00:5e
00:00:5e:00:01:01 2 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:01:02 3 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:01:03 4 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:01:04 5 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:09 7 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:01 8 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:02 9 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:03 10 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:04 11 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:05 12 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:06 13 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:07 14 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
00:00:5e:00:02:08 15 0 0x0014 vtep.32769 vtep.32769 0 0x1 vtep.32769
Summary
Whilst Broadcom based QFX switches can currently only install a single VTEP for gateway next-hop, this is on a per-VNI basis and you should expect to achieve a relatively good balance on flows. The gateway VTEP will not be impacted or changed, in the event on an underlay condition such as a link failure, providing there is still a path to the GW VTEP. To change this behaviour you would need to ensure the GW VTEP loopback is removed from the underlay. This would force the Overlay to drop and the GW would move to the other Core/Spine.
Load-balancing is done on a per-VNI basis (on my testing with QFX5200 as well), It is random if spine 1 or 2 is selected for the first VTEP in the forwarding-table. Apologies if this is known and you were meaning some other hash mechanism :)!
Click to access nce-156-evpn-virtual-gateway-vxlan.pdf
“NOTE:The Junos OS version used on the [QFX5100] devices in this configuration example load-balances anycast gateways per VNI. For a given VNI, the switch forwards traffic to a single VTEP”
LikeLike
Yeah I’ve seen that mentioned a lot but I only ever see the same VTEP selected for all VNIs. I’ve tried to find out the hashing algorithm but no luck so far. Perhaps I should scale the lab to a few hundred of VNIs to be sure.
LikeLike
Ahh my lab is broken until I can get into the office – here is an old screenshot for validation:
https://imgur.com/a/6sJPVgN
That behaviour seen in the forwarding table alternates between every VNI. Perhaps the behaviour differs on the QFX5200 due to their tomahawk chipset. Worth checking!
LikeLike
I’ve just added a bunch more VNIs to my setup and seen some interesting behavior. For all the new VLANs that I added, they have installed the second core for gateway. So the balance is around 70/30. Someone has sent me the hash info now so I’m going to do some more testing to see how that can be manipulated. Thanks for your input btw π
LikeLike
Weird, different behaviour on QFX5200! Must be the difference in chipset/code. Maybe also a difference in the underlay protocol, as I had issues with eBGP vs OSPF, despite the fact that it shouldn’t really make a difference what is used in the underlay! Thanks for the interesting blog post, was good read π
LikeLike
This is a great great post, but it would be great if you could also cover the distributed anycast gateway scenario (tor gateway, same ip and same Mac)
LikeLike
Yes, perhaps this is something I could look at. Which TOR switch? QFX5110?
LikeLike
Definitely the 5120. If I remember correctly the 5100 and 5110 can’t do symmetric irb
LikeLike
I should have some Trident 3 based hardware arriving soon for a new customer project. Once it arrives I’ll be able to get it in the lab for testing π
LikeLike