EVPN-VXLAN | Virtual Gateway |QFX5k Forwarding | JUNOS

In this post, I want to discuss how to verify Virtual Gateway forwarding behaviour on Broadcom based Juniper QFX switches.

The general assumption with EVPN Anycast Gateway is that gateway flows are load-balanced across all gateway devices. And whilst EVPN provides the mechanism to support this behaviour, there is a requirement for the forwarding hardware to also support it.

The mechanism for an EVPN device to load balance gateway flows is to install the virtual gateway ESI as a next-hop for the virtual gateway MAC address. However, Broadcom based QFX switches do not support this behaviour and can only install a single VTEP as a next-hop. So this means that traffic flows heading towards the virtual gateway will only ever traverse via a single gateway device. This behaviour is well documented and there are some talks about Broadcom working with the vendors to improve gateway load-balancing with ESI functionality.

Now we understand the characteristics, let’s look at the steps to verify forwarding behaviour on a Broadcom based QFX switch. Here we’ll look at how to identify which VTEP is being used to reach the virtual-gateway MAC address and how the underlay is transporting the traffic.

Lab Setup

The lab setup is a typical EVPN-VXLAN fabric with central routing on Core1 and Core2. The leaf switches are a mix of QFX5100 (Trident II) and QFX5110 (Trident II+) devices. I’ve used physical hardware for this lab as the behaviour of the vQFX is different. The vQFX emulates the Q5 PFE,Β  thus supporting ESI for forwarding.

LAB EVPN SETUP

Step 1 – Identify Virtual Gateway MAC address

It’s good practice to manually set the virtual gateway MAC address and ESI for the IRB gateway. This makes verification and troubleshooting easier as you can quickly identify the values.

In this example, we will study a flow from S1-101, via LEAF1, to S3-102.

The Ethernet switching table on LEAF1 shows the VGW MAC is 00:00:5e:00:01:01 for VLAN101. We can also see the ESI value is 00:00:00:66:66:00:00:00:01:01.

lab@LEAF1-QFX5110> show ethernet-switching table vlan-id 101    

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC, O - ovsdb MAC)


Ethernet switching table : 6 entries, 6 learned
Routing instance : default-switch
   Vlan                MAC                 MAC      Logical                Active
   name                address             flags    interface              source
   VLAN101             00:00:5e:00:01:01   DR       esi.1775               00:00:00:66:66:00:00:00:01:01 
   VLAN101             64:64:9b:d6:68:20   D        vtep.32770             10.0.255.11                   
   VLAN101             80:71:1f:c6:26:f0   D        vtep.32769             10.0.255.12                   
   VLAN101             f8:c0:01:cb:ab:03   DL       ae0.0                
   VLAN101             f8:c0:01:cb:ab:05   D        vtep.32772             10.0.255.3                    
   VLAN101             f8:c0:01:cb:ab:06   D        vtep.32773             10.0.255.4  

Step 2 – Verify underlay forwarding table for Virtual Gateway MAC

The extensive output details the forwarding path used to reach the VGA MAC on LEAF1. In this instance, the path via CORE1 is used. However, this doesn’t necessarily mean that CORE1 is the active gateway. This can be quite misleading at times.

lab@LEAF1-QFX5110> show route forwarding-table family ethernet-switching destination 00:00:5e:00:01:01 extensive    
Routing table: default-switch.bridge [Index 6] 
Bridging domain: VLAN101.bridge [Index 3] 
VPLS:
Enabled protocols: Bridging, ACKed by all peers, 
    
Destination:  00:00:5e:00:01:01/48
  Learn VLAN: 0                        Route type: user                  
  Route reference: 0                   Route interface-index: 559 
  Multicast RPF nh index: 0         
  P2mpidx: 0              
  IFL generation: 136                  Epoch: 0   
  Sequence Number: 1                   Learn Mask: 0x4000000000000000030000000000000000000000
  L2 Flags: control_dyn
  Flags: sent to PFE
  Nexthop:  
  Next-hop type: composite             Index: 1747     Reference: 19   
  Next-hop type: indirect              Index: 131075   Reference: 3    
  Nexthop: 10.0.66.0
  Next-hop type: unicast               Index: 1709     Reference: 9    
  Next-hop interface: ge-0/0/44.0

Step 3 – Verify PFE forwarding table for Virtual Gateway MAC

Next up, we check the PFE forwarding table on LEAF1. This output details the VTEP that has been installed to reach the VGW MAC address. Below we can see that vtep.32770 is used to reach VLAN101 VGW MAC 00:00:5e:00:01:01.

lab@LEAF1-QFX5110> request pfe execute command "show l2 manager mac-table" target fpc0 
SENT: Ukern command: show l2 manager mac-table

route table name   : default-switch.6
  mac counters
    maximum   count
    0           24
  mac table information
  mac address       BD     learn  Entry  entry      hal        hardware info
                    Index  vlan   Flags  ifl        ifl        pfe  mask  ifl
  ----------------------------------------------------------------------------------
  00:00:5e:00:01:01  3     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  64:64:9b:d6:68:20  3     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  80:71:1f:c6:26:f0  3     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  f8:c0:01:cb:ab:03  3     0     0x0814 ae0.0      ae0.0       0   0x1  ae0.0
  f8:c0:01:cb:ab:05  3     0     0x0014 vtep.32772 vtep.32772  0   0x1  vtep.32772
  f8:c0:01:cb:ab:06  3     0     0x0014 vtep.32773 vtep.32773  0   0x1  vtep.32773
  00:00:5e:00:01:02  4     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  64:64:9b:d6:68:20  4     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  80:71:1f:c6:26:f0  4     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  f8:c0:01:cb:ab:03  4     0     0x0814 ae0.0      ae0.0       0   0x1  ae0.0
  f8:c0:01:cb:ab:05  4     0     0x0014 vtep.32772 vtep.32772  0   0x1  vtep.32772
  f8:c0:01:cb:ab:06  4     0     0x0014 vtep.32773 vtep.32773  0   0x1  vtep.32773

Note. If you’re using a QFX10K, vQFX, or an MX based device, then you will an ESI value in place of vtep.32771.

Step 4 – Verify Virtual Gateway

Lastly, we verify the VTEP that is used to reach the VGW MAC 00:00:5e:00:01:01 for VLAN101 on LEAF1. VTEP.32770 is used to reach the VGW MAC.

The output confirms that 10.0.255.11, which is CORE1, is used for all gateway flows.

lab@LEAF1-QFX5110> show interfaces vtep.32770 
  Logical interface vtep.32770 (Index 559) (SNMP ifIndex 579)
    Flags: Up SNMP-Traps Encapsulation: ENET2
    VXLAN Endpoint Type: Remote, VXLAN Endpoint Address: 10.0.255.11, L2 Routing Instance: default-switch, L3 Routing Instance: default
    Input packets : 1310
    Output packets: 3401
    Protocol eth-switch, MTU: Unlimited
      Flags: Trunk-Mode

Based on the above output, the flow from S1-101 to S3-102 looks like this:

PRIMARY FLOW

Now let’s look at a scenario where the primary path to CORE1 (VTEP.32770) is lost on LEAF1.

Disable ge-0/0/44 on LEAF1

{master:0}[edit]
lab@LEAF1-QFX5110# set interfaces ge-0/0/44 disable 

Verify the underlay forwarding table for Virtual Gateway MAC has changed. Here we can see that the underlay is now using ge-0/0/46 to reach the VGA MAC.

lab@LEAF1-QFX5110> show route forwarding-table family ethernet-switching destination 00:00:5e:00:01:01 extensive    
Routing table: default-switch.bridge [Index 6] 
Bridging domain: VLAN101.bridge [Index 3] 
VPLS:
Enabled protocols: Bridging, ACKed by all peers, 
    
Destination:  00:00:5e:00:01:01/48
  Learn VLAN: 0                        Route type: user                  
  Route reference: 0                   Route interface-index: 559 
  Multicast RPF nh index: 0         
  P2mpidx: 0              
  IFL generation: 136                  Epoch: 0   
  Sequence Number: 1                   Learn Mask: 0x4000000000000000030000000000000000000000
  L2 Flags: control_dyn
  Flags: sent to PFE
  Nexthop:  
  Next-hop type: composite             Index: 1747     Reference: 19   
  Next-hop type: indirect              Index: 131075   Reference: 3    
  Nexthop: 10.0.66.8
  Next-hop type: unicast               Index: 1756     Reference: 16   
  Next-hop interface: ge-0/0/46.0  

Let’s now check the PFE forwarding table.

lab@LEAF1-QFX5110> request pfe execute command "show l2 manager mac-table" target fpc0                              
SENT: Ukern command: show l2 manager mac-table

route table name   : default-switch.6
  mac counters
    maximum   count
    0           24
  mac table information
  mac address       BD     learn  Entry  entry      hal        hardware info
                    Index  vlan   Flags  ifl        ifl        pfe  mask  ifl
  ----------------------------------------------------------------------------------
  00:00:5e:00:01:01  3     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  64:64:9b:d6:68:20  3     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  80:71:1f:c6:26:f0  3     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  f8:c0:01:cb:ab:03  3     0     0x0814 ae0.0      ae0.0       0   0x1  ae0.0
  f8:c0:01:cb:ab:05  3     0     0x0014 vtep.32772 vtep.32772  0   0x1  vtep.32772
  f8:c0:01:cb:ab:06  3     0     0x0014 vtep.32773 vtep.32773  0   0x1  vtep.32773
  00:00:5e:00:01:02  4     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  64:64:9b:d6:68:20  4     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  80:71:1f:c6:26:f0  4     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  f8:c0:01:cb:ab:03  4     0     0x0814 ae0.0      ae0.0       0   0x1  ae0.0
  f8:c0:01:cb:ab:05  4     0     0x0014 vtep.32772 vtep.32772  0   0x1  vtep.32772
  f8:c0:01:cb:ab:06  4     0     0x0014 vtep.32773 vtep.32773  0   0x1  vtep.32773
  00:00:5e:00:01:03  5     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770

Note that the VTEP used to reach the VGA MAC has not changed – vtep.32770 is still used. So this means that traffic is routing via CORE2 and across the core link to CORE1. Many of my customers are surprised by this behavior as it effectively introduces an element of suboptimal routing. However, generally speaking, this not really an issue unless you are stretching layer 2 across multiple DCs, particularly when the gateway selected is located in a remote DC. Note. Juniper has recently introduced functionality for EVPN route filtering in JUNOS 19.4R1. So this can be used to filter remote DC VGA MACs etc.

SECONDARY FLOW

Load-balancing Hash (Updated)

I often get asked how a Broadcom based QFX switch will select a VTEP for gateway. Junos uses a hashing algorithm, based on a number of variables, to install a next-hop gateway on a per-VNI basis. I’m not sure if I can share the specific hashing detail here, due to NDA, but I’m checking.

I added a few more VNIs to the setup and observed some interesting behaviour. Initially, all the new VNIs selected the other Core for GW. I then rebooted the leaf switch following which all VNIs selected a single GW. I need to spend a little more time looking at the hashing algorithm to better understand what’s happening.

Before reboot:

lab@LEAF1-QFX5110> request pfe execute command "show l2 manager mac-table" target fpc0 | match 00:00:5e    
  00:00:5e:00:01:01  2     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  00:00:5e:00:01:02  3     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  00:00:5e:00:01:03  4     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  00:00:5e:00:01:04  5     0     0x0014 vtep.32770 vtep.32770  0   0x1  vtep.32770
  00:00:5e:00:02:09  7     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:01  8     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:02  9     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:03  10    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:04  11    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:05  12    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:06  13    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:07  14    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:08  15    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769

After reboot:

lab@LEAF1-QFX5110> request pfe execute command "show l2 manager mac-table" target fpc0 | match 00:00:5e    
  00:00:5e:00:01:01  2     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:01:02  3     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:01:03  4     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:01:04  5     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:09  7     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:01  8     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:02  9     0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:03  10    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:04  11    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:05  12    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:06  13    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:07  14    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769
  00:00:5e:00:02:08  15    0     0x0014 vtep.32769 vtep.32769  0   0x1  vtep.32769

Summary

Whilst Broadcom based QFX switches can currently only install a single VTEP for gateway next-hop, this is on a per-VNI basis and you should expect to achieve a relatively good balance on flows. The gateway VTEP will not be impacted or changed, in the event on an underlay condition such as a link failure, providing there is still a path to the GW VTEP. To change this behaviour you would need to ensure the GW VTEP loopback is removed from the underlay. This would force the Overlay to drop and the GW would move to the other Core/Spine.

9 thoughts on “EVPN-VXLAN | Virtual Gateway |QFX5k Forwarding | JUNOS

  1. Load-balancing is done on a per-VNI basis (on my testing with QFX5200 as well), It is random if spine 1 or 2 is selected for the first VTEP in the forwarding-table. Apologies if this is known and you were meaning some other hash mechanism :)!

    Click to access nce-156-evpn-virtual-gateway-vxlan.pdf

    “NOTE:The Junos OS version used on the [QFX5100] devices in this configuration example load-balances anycast gateways per VNI. For a given VNI, the switch forwards traffic to a single VTEP”

    Like

    1. Yeah I’ve seen that mentioned a lot but I only ever see the same VTEP selected for all VNIs. I’ve tried to find out the hashing algorithm but no luck so far. Perhaps I should scale the lab to a few hundred of VNIs to be sure.

      Like

  2. Ahh my lab is broken until I can get into the office – here is an old screenshot for validation:
    https://imgur.com/a/6sJPVgN

    That behaviour seen in the forwarding table alternates between every VNI. Perhaps the behaviour differs on the QFX5200 due to their tomahawk chipset. Worth checking!

    Like

    1. I’ve just added a bunch more VNIs to my setup and seen some interesting behavior. For all the new VLANs that I added, they have installed the second core for gateway. So the balance is around 70/30. Someone has sent me the hash info now so I’m going to do some more testing to see how that can be manipulated. Thanks for your input btw πŸ™‚

      Like

      1. Weird, different behaviour on QFX5200! Must be the difference in chipset/code. Maybe also a difference in the underlay protocol, as I had issues with eBGP vs OSPF, despite the fact that it shouldn’t really make a difference what is used in the underlay! Thanks for the interesting blog post, was good read πŸ‘

        Like

  3. This is a great great post, but it would be great if you could also cover the distributed anycast gateway scenario (tor gateway, same ip and same Mac)

    Like

      1. I should have some Trident 3 based hardware arriving soon for a new customer project. Once it arrives I’ll be able to get it in the lab for testing πŸ™‚

        Like

Leave a comment