Community discussions

MikroTik App
 
MaxwellSilver
just joined
Topic Author
Posts: 15
Joined: Thu Oct 07, 2021 10:11 pm

OSPF, Wireguard, and multiple path problem

Tue Jun 06, 2023 12:19 am

Problem description:

Three remote sites are connected to Main office. Site1 and site3 connect to the Main Office via Wireguard. Site2 connects to Site1 and Site3 via radio links and has no direct link to the public internet. This is a ring topology in a remote area; multiple pathways to/from site are a design requirement to make sure data flows with maximum uptime.

Routing is done via OSPF. The link between Site1 and Main Office is fiber so OSPF cost=10. The Link between Site3 and Main Office is via cell modem so OSPF cost=500 to weigh it less favorably.

OSPF works as desired if a radio link between Site2 and one of the adjacent sites fails; meaning the alternative Wireguard path via WG-site3 becomes active with Site2 and/or Site3 remaining accessible via that link.

The problem occurs when the link between Site1 and the Main Office fails. The route via WG-site3 takes over for ~15 seconds and then the route via WG-Site1 takes its place in the routing table for ~35seconds before reverting once again to WG-Site3. This pattern continues indefinitely, causing an unusable flapping situation.

Once the route via WG-site3 becomes active, WG-site1 believes it has re-established its connection since the remote side of that Wireguard connection is available via the route throughWG-site3. This does not occur when using IPsec for VPN. I believe the issue hinges on the fact that Wireguard interfaces are considered to be in a running state regardless of whether there is actually a viable session with its peer.

It seems like there should be a workable solution, but as of yet, it has escaped me. I would like to continue using Wireguard since the performance is much better, especially over poor connections like cell modems. Configuration is included below. Thanks in advance for your thoughts and suggestions.
Main Office router config:

/interface wireguard add listen-port=13230 mtu=1420 name=WG-site1
/interface wireguard add listen-port=13229 mtu=1420 name=WG-site3
/interface wireguard peers add allowed-address=0.0.0.0/0 comment=site1 endpoint-port=13230 interface=WG-site1 persistent-keepalive=30s public-key=""
/interface wireguard peers add allowed-address=0.0.0.0/0 comment=site3 endpoint-port=13229 interface=WG-site3 persistent-keepalive=30s public-key=""

/routing ospf interface-template add area=backbone cost=10  disabled=no interfaces=WG-site1 networks=10.10.128.0/30 type=ptp
/routing ospf interface-template add area=backbone cost=500 disabled=no interfaces=WG-site3 networks=10.10.128.4/30 type=ptp

/ip address add address=10.10.128.1/30 interface=WG-site1 network=10.10.128.0/30
/ip address add address=10.10.128.5/30 interface=WG-site3 network=10.10.128.4/30
Site1 router config:
/interface wireguard add comment="WG to Main Office" listen-port=13230 mtu=1420 name=wireguard1
/interface wireguard peers add allowed-address=0.0.0.0/0 endpoint-address=192.168.10.1 endpoint-port=13230 interface=wireguard1 persistent-keepalive=30s public-key=""

/routing ospf interface-template add area=backbone-v2 cost=10 disabled=no interfaces=wireguard1 networks=10.10.128.0/30 type=ptp
/ip address add address=10.10.128.2/30 interface=wireguard1 network=10.10.128.0/30
Site3 router config:
/interface wireguard add comment="WG to Main Office" listen-port=13229 mtu=1420 name=wireguard1
/interface wireguard peers add allowed-address=0.0.0.0/0 endpoint-address=192.168.20.1 endpoint-port=13229 interface=wireguard1 persistent-keepalive=30s public-key=""

/routing ospf interface-template add area=backbone-v2 cost=500 disabled=no interfaces=wireguard1 networks=10.10.128.4/30 type=ptp
/ip address add address=10.10.128.6/30 interface=wireguard1 network=10.10.128.4/30
You do not have the required permissions to view the files attached to this post.
 
wiseroute
Member
Member
Posts: 352
Joined: Sun Feb 05, 2023 11:06 am

Re: OSPF, Wireguard, and multiple path problem

Tue Jun 06, 2023 5:10 am

hello Maxwell,
. I believe the issue hinges on the fact that Wireguard interfaces are considered to be in a running state regardless of whether there is actually a viable session with its peer.
maybe you could try to lower those persistent keepalive value between peers :
/interface wireguard peers add allowed-address=0.0.0.0/0 comment=site1 endpoint-port=13230 interface=WG-site1 persistent-keepalive=30s public-key=
lowering it down to value which fit to your ospf requirements.

and if that doesn't work well, as a workaround maybe you could force ospf changing its interface members ie. not to wait the interface flapping states - maybe you can trigger a netwatch script to enable/disable that specific wireguard from the ospf instance

example:

when the netwatch sense remote gateway doesn't respond to its pings - do auto shut wg interface from ospf. and vice versa. just to force new route.

+++ edit

other options,
is to shorten the ospf hello message for that specific links - to force the ospf to have faster re route convergence.

have a lab first, good luck 👍🏻
 
MaxwellSilver
just joined
Topic Author
Posts: 15
Joined: Thu Oct 07, 2021 10:11 pm

Re: OSPF, Wireguard, and multiple path problem

Tue Jun 06, 2023 7:57 pm

Hi Wiseroute, Thanks for the suggestions. I have tried removing the keep alive value completely, but that causes the sessions to drop completely. I will try a few settings in between to see if that remedies the problem.

The network described in this post is a lab network so I can experiment without negative consequences. The lab is, however, emulating a real world issue that we are experiencing so it will be good to have this issue resolved.

I'm not great as scripting with RouterOS, but I'll try that if massaging the keep-alive settings isn't fruitful.


I'm open to other suggestions and also curious if other MikroTik network folks have seen similar issues.
 
MaxwellSilver
just joined
Topic Author
Posts: 15
Joined: Thu Oct 07, 2021 10:11 pm

Re: OSPF, Wireguard, and multiple path problem

Thu Jun 08, 2023 11:57 pm

It was worth trying but lowering and/or removing keep alive times on the Wireguard peers didn't make any difference.

I've also tried via firewall rules to explicitly drop traffic from/to a wireguard network from using a non-wireguard interface. I had hoped this would cause the ospf neighbors to drop when the Wireguard connection wasn't functional, but it has zero effect. It didn't even increment the firewall rule with hits, even though it was the topmost filter rule.

Here's the rule at Site1:
0    chain=output action=drop dst-address=10.10.128.1 out-interface=!wireguard1 priority=0 log=yes log-prefix=""
I already have a watchdog timer set, but because the alternative route kicks in immediately, it hasn't initiated a reboot. Next, I'm going to try Wiseroute's idea of kicking off a script via the watchdog. Not sure how that gets setup, but it should be interesting to look into.
 
wiseroute
Member
Member
Posts: 352
Joined: Sun Feb 05, 2023 11:06 am

Re: OSPF, Wireguard, and multiple path problem

Fri Jun 09, 2023 9:26 am

hello Maxwell,

I'm sorry, didn't remember this thread.

ok. let us continue shall we?

your topology was a-b-c-d-a , am i correct?

a-b is dark fiber
a-d is lte. am i correct?

ab is 1, ad is 500. and you want that if ab has back online - you want to re route ad 500 to ab 1, correct?

your problem was only a *stale backup route* which ospf seemed not to closed it - when original main link got active again.
already have a watchdog timer set, but because the alternative route kicks in immediately
i was referring to netwatch tool - not watchdog timer :)

netwatch from d to ping its gateway on router a via interface to router c. (which is the original main link).

if it has found d-a via c is good enough - then there could be 2 (on d router) :
1. netwatch to shut down/disable ospf interface to lte device. to force the ospf using the *original route*.
2. if that doesn't work either, the next solution is to plant a static default route using netwatch (after the first netwatch - original route detection) with metric 110 to c - this will put router d has ecmp : ospf (if back to normal) and static. 110.

lab it first and good luck 👍🏻
 
wiseroute
Member
Member
Posts: 352
Joined: Sun Feb 05, 2023 11:06 am

Re: OSPF, Wireguard, and multiple path problem

Fri Jun 09, 2023 6:40 pm

@ maxwell,

hmm, done some lab for you - abcd topology, which i think your topology turned out to be normal one? :)
running on v7.6

the modified point was only on router d/LTE :
/routing ospf instance
add disabled=no name=default redistribute=connected router-id=172.16.1.4
/routing ospf area
add disabled=no instance=default name=backbone
/routing ospf interface-template
add area=backbone cost=500 disabled=no interfaces=vda priority=64
add area=backbone disabled=no interfaces=vdc

---

this is to mimic your LTE interface (1mbps) :

[admin@d] > /queue/simple/print
Flags: X - disabled, I - invalid; D - dynamic
 0    name="queue1" target=vda parent=none packet-marks="" priority=8/8 queue=default-small/default-small limit-at=0/0
      max-limit=1024k/1024k burst-limit=0/0 burst-threshold=0/0 burst-time=0s/0s bucket-size=0.1/0.1
      
please note that interface vda (from router d to a, or your LTE link) - cost=500 and prio=64, which is lower than any other a-b-c routers normal interface with cost=1 and prio=128.

in normal operation, the traffic flow would be from d --> c --> b --> a.

a. lo0 1.1
b. lo0 1.2
c. lo0 1.3
d. lo0 1.4
[admin@c] > ip route/print where ospf
Flags: D - DYNAMIC; A - ACTIVE; o, y - COPY
Columns: DST-ADDRESS, GATEWAY, DISTANCE
    DST-ADDRESS    GATEWAY        DISTANCE
DAo 10.0.10.0/24   10.0.20.1%vcb       110
DAo 10.0.40.0/24   10.0.20.1%vcb       110
DAo 172.16.1.1/32  10.0.20.1%vcb       110
DAo 172.16.1.2/32  10.0.20.1%vcb       110
DAo 172.16.1.4/32  10.0.30.2%vcd       110

note those route to router a via router b/vcb
---

[admin@c] > tool/traceroute address=172.16.1.1 src-address=172.16.1.3
Columns: ADDRESS, LOSS, SENT, LAST, AVG, BEST, WORST, STD-DEV
#  ADDRESS     LOSS  SENT  LAST   AVG   BEST  WORST  STD-DEV
1  10.0.20.1   0%      13  2.2ms  5.1   1.7   37.8   9.5 --> this via router b
2  172.16.1.1  0%      13  3ms    11.6  3     82.2   20.6 --> router a.

---

bandwidth-test in normal link : d --> c --> b --> a

[admin@c] > /tool/bandwidth-test address=172.16.1.1 direction=both user=admin password=admin
                    ;;; results can be limited by cpu, note that traffic generation/termination performance might not
                        be representative of forwarding performance
                status: running
              duration: 9s
            tx-current: 41.3Mbps
  tx-10-second-average: 35.5Mbps
      tx-total-average: 35.5Mbps
            rx-current: 31.7Mbps
  rx-10-second-average: 31.7Mbps
      rx-total-average: 31.7Mbps
          lost-packets: 102
           random-data: no
             direction: both
               tx-size: 1500
               rx-size: 1500
      connection-count: 20
        local-cpu-load: 90%
       remote-cpu-load: 95%

in any case link a to b dropped, the flow should go b --> c --> d --> a.
[admin@c] > ip route/print where ospf
Flags: D - DYNAMIC; A - ACTIVE; o, y - COPY
Columns: DST-ADDRESS, GATEWAY, DISTANCE
    DST-ADDRESS    GATEWAY        DISTANCE
DAo 10.0.10.0/24   10.0.20.1%vcb       110
DAo 10.0.40.0/24   10.0.30.2%vcd       110
DAo 172.16.1.1/32  10.0.30.2%vcd       110
DAo 172.16.1.2/32  10.0.20.1%vcb       110
DAo 172.16.1.4/32  10.0.30.2%vcd       110

----

note those routes to a now changed to router d/vcd

the failover :

   17 172.16.1.1                                 56  63 7ms639us
   18 172.16.1.1                                 56  63 21ms967us
   19 172.16.1.1                                 56  63 4ms981us
    sent=20 received=20 packet-loss=0% min-rtt=3ms964us avg-rtt=6ms986us max-rtt=21ms967us
  SEQ HOST                                     SIZE TTL TIME       STATUS
   20 172.16.1.1                                                   timeout
   21 172.16.1.1                                                   timeout
   22 172.16.1.1                                 56  63 8ms71us
   23 172.16.1.1                                 56  63 5ms826us
   24 172.16.1.1                                 56  63 10ms367us

only took 2 seconds.

---

new route LTE bandwidth-test :

[admin@c] > /tool/bandwidth-test address=172.16.1.1 direction=both user=admin password=admin
                    ;;; results can be limited by cpu, note that traffic generation/termination performance might not
                        be representative of forwarding performance
                status: running
              duration: 7s
            tx-current: 930.3kbps
  tx-10-second-average: 963.6kbps
      tx-total-average: 963.6kbps
            rx-current: 953.8kbps
  rx-10-second-average: 963.6kbps
      rx-total-average: 963.6kbps
          lost-packets: 64
           random-data: no
             direction: both
               tx-size: 1500
               rx-size: 1500
      connection-count: 20
        local-cpu-load: 13%
       remote-cpu-load: 8%

no netwatch nor any other things needed. just plain ospf.

hope this helps.
 
MaxwellSilver
just joined
Topic Author
Posts: 15
Joined: Thu Oct 07, 2021 10:11 pm

Re: OSPF, Wireguard, and multiple path problem

Thu Jul 13, 2023 8:47 pm

@wiseroute Thank you for taking the time writing up those labs. I've been out of the office and putting out fires (figuratively) for several weeks. Now I'm finally coming back around to this problem. Apologies for the lack of communication.

You are understanding the topology, direction of data flow, and fail-over behavior of my lab system. Thank you for the idea of setting router priority on router D. It's my understanding that the Priority setting in OSPF influences which router becomes the Designated Router. I went ahead and tested changing the router priority at site 'D' to 64, but I'm still having the same issue. I may have missed the point of your lab, if so, please accept my apology.

During the testing I've been doing, I've established that the keep alive timers are only needed if there is zero traffic across the Wireguard interfaces. The OSPF traffic is enough to keep the connection open. That simplifies things slightly.

Thanks for clarifying Netwatch vs Watchdog. I should have read your post more carefully. :? I'm working on the issue some more now and will attempt to launch a script via Netwatch to disable the B-A ospf interface when it's not available. I may also experiment with firewall rules to reject WG traffic going the 'wrong' way.
 
wiseroute
Member
Member
Posts: 352
Joined: Sun Feb 05, 2023 11:06 am

Re: OSPF, Wireguard, and multiple path problem

Thu Jul 13, 2023 11:26 pm

@ Maxwell,

went ahead and tested changing the router priority at site 'D' to 64, but I'm still having the same issue
no no. that 64 is the lte path priority, with its default cost you have set = 500. not the router priority.

hope this helps.
 
MaxwellSilver
just joined
Topic Author
Posts: 15
Joined: Thu Oct 07, 2021 10:11 pm

Re: OSPF, Wireguard, and multiple path problem  [SOLVED]

Thu Jul 20, 2023 8:29 pm

This issue has been resolved. :D :D As is sometimes the case, the apparent problem wasn't really the issue. I'll be brief, but I hope this can help somebody else in the future.

The design requires that sites 1-3 from the original diagram are able to access the public internet without first going through the Main Office. Therefore, non-office bound traffic was being masqueraded on the public interfaces of router1 and router3. This had the unintended effect of allowing router1 to establish the Wireguard connection via the router3 public interface. Once the tunnel came up, it immediately failed and the cycle began again causing the flapping that was mentioned above.

The resolution is simple: I am now limiting the masqueraded traffic to the local client networks at each site. This has the effect of NOT allowing masquerade for the router interfaces connected to the radio links. Problem solved.

Silver lining is that I explored Netwatch. It's a powerful tool.
 
wiseroute
Member
Member
Posts: 352
Joined: Sun Feb 05, 2023 11:06 am

Re: OSPF, Wireguard, and multiple path problem

Sun Jul 23, 2023 9:42 am

hello Maxwell,

glad to hear you have solved your network 👍🏻

The design requires that sites 1-3 from the original diagram are able to access the public internet without first going through the Main Office. Therefore, non-office bound traffic was being masqueraded on the public interfaces of router1 and router3. This had the unintended effect of allowing router1 to establish the Wireguard connection via the router3 public interface. Once the tunnel came up, it immediately failed and the cycle began again causing the flapping that was mentioned above.
well, the basic principles will always be the same. whether you have physical or virtual tunnel interface on any routers, or whether you were on full ip routing or under natted network - those ospf path cost and priority (read : ospf parameters) will follow the network design.

yes. netwatch is a nice tool to have 👍🏻
 
User avatar
spippan
Member
Member
Posts: 333
Joined: Wed Nov 12, 2014 1:00 pm
Location: Austria

Re: OSPF, Wireguard, and multiple path problem

Sat Aug 05, 2023 3:22 pm

This issue has been resolved. :D :D As is sometimes the case, the apparent problem wasn't really the issue. I'll be brief, but I hope this can help somebody else in the future.

The design requires that sites 1-3 from the original diagram are able to access the public internet without first going through the Main Office. Therefore, non-office bound traffic was being masqueraded on the public interfaces of router1 and router3. This had the unintended effect of allowing router1 to establish the Wireguard connection via the router3 public interface. Once the tunnel came up, it immediately failed and the cycle began again causing the flapping that was mentioned above.

The resolution is simple: I am now limiting the masqueraded traffic to the local client networks at each site. This has the effect of NOT allowing masquerade for the router interfaces connected to the radio links. Problem solved.

Silver lining is that I explored Netwatch. It's a powerful tool.
;-D if the problem in a network is not DNS, most of the time it is NAT
joking aside - thanks for your clarification! a simple step but a VITAL one because it often gets forgotten

Who is online

Users browsing this forum: No registered users and 19 guests