ECMP bug and workaround

ECMP does not work for locally generated packets when using separate interfaces and not masquerading own external address.

Example scenario:
There are 2 gateways: 192.168.16.1 and 192.168.17.1.

  • Interface1: 192.168.16.2/24
  • Interface2: 192.168.17.2/24
  • Interface0: 192.168.0.1/24
  • Default route has multiple gateways: 192.168.16.1,192.168.17.1

Both gateways work separately, but when i configure the multiple gateway some things stop working:

  • DNS stops working (routeros cannot resolve anymore), telnet also fails (cannot do /system telnet command from the box).
  • On the other hand ping and traceroute still work.
  • Traffic routed through, but not generated by, the routeros box works.

After examining traffic with torch and also with a sniffer i noticed that it was using wrong ip addresses and interface.
That is, it was sending packets to Interface2 with src address 192.168.16.2 !!
Moreover, when this happens it does not balance connections, almost always they were sent through the same interface and (wrong) src address.

The workaround:
Either masquerade interface or external ip:

  • option 1: out-interface=Interface1 action=masquerade, out-interface=Interface2 action=masquerade
  • option 2: src-address=192.168.16.1 action=masquerade, src-address=192.168.17.1 action=masquerade
    (I was using masquerade on the client range (src=192.168.0.0/24 as set up by hotspot), but that is not enough).

The example in reference manual might not work, depending on how you do masquerade (the example does not specify it).
This has been tested on 2.9.51 and 3.14.

Also had that problem, connections that goes though also get TCP_RST packets alot of the times, causing winbox/ssh/telnet to dc for router behind it(somethimes it cant stay connected for more than 30s). Thus I made sure OSPF only have one route to each destination by changing the path cost. Problem solved.

Trying to loadbalance by Nth term fails too, I’m guessing that the root of the problem is the same. Its work fine for a few packets, then starts dropping half the packets.