MPLS and other stability issues

Hi Everyone:

I run a small AS as a (Mikrotik-) Training vehicle. The core is a CHR and a x86 as main BGP routers in two data centers, with 2x Transit and a few peering sessions. All external interfaces are via physical Mikrotik routers, connected via MPLS PW to the CHR/x86 core routers. There are also off-site routers connected via GRE/IPsec and L2TP/IPsec, and through MPLS PWs i can move any L2 interface to any location in the network, which has a total of about 10 routers. All in all, a mix of professional-grade links (1G Ethernet between data centers, Transit, Peering) and consumer-grade DSL links carrying L2TP and GRE. The network allows for some experimentation, while of course BGP advertisement and peers should be stable.
The network was reliable over a few years; throughout the last 12 months I was replacing ROS V6 with ROS V7. I am thankful to Mikrotik for supporting older routers with current firmware – it is not only generous, it also allows to run a whole network on a single ROS version. Having said that, I continue to struggle with some stability issues:

1) BGP itself seems to be stable at 7.16 with session uptimes of 100 days or more. Two external sessions with full-table (approx 980k routes), plus peering, both IPv4 and IPv6. The setup is simple (full mesh / no RR).

2) MPLS headache: I was unable to get MPLS/LDP to work reliably in V7 - a few weeks ago I replaced all MPLS PW with EoIP, and since then the stability issues are gone. One issue might be that I have no intent to use MPLS for IP traffic (only for PW): For this reason, I use LDP Advertise Filters (also Accept Filters) so that only Router Loopbacks (all in a /24) are propagated. While this works seemingly at least initially, later on suddenly a router becomes unreachable. After debugging OSPF LSPs and routing tables on the whole path forth and back, it becomes apparent that the routes are correct. Turn off MPLS, and the router becomes reachable. Seems that MPLS directs some traffic into LSPs, but the LSPs lead to nirvana. Contributing to the issue, in my opinion, is the rather dynamic nature of my network: L2TPs come and go, causing relatively frequent (= every few days) topology changes. It seems to me that there are race conditions in the code – between processes, or also between CPUs – with the result that some things usually work, but then statistically fail the 30th or 50th time. I also had OSPF in once case not adapting to a topology change – due to lack of time I had to reboot the CHR, and voila it worked. I am also aware of tips here in the forum relating to MPLS: To exclusively assign a label range per router, and to use the same Hello intervals for LDP and OSPF. However, if this helps, then it confirms issues - and such solutions only lower the probability, i.e. one then has to “less often” restart half the network to make it operate again.

3) Testing: My real question is how I can manage to run into stability issues with only a few hours time for Mikrotik every month? Shouldn’t Mikrotik catch such issues? Given the (sometimes) very few hours between when a ROS version is compiled and then it is published, I can only suspect that not that much quality assurance is happening at Mikrotik. My suggestion would be a) that Mikrotik publicly documents the tests conducted, including the configurations. Such “reference networks” would make us users understand whether we are in uncharted territorry, or whether we are using a feature in a known way. It would also complement the documentation. And please use scripting to artificially create unstable networks in the lab; then let the lab run for weeks to understand whether software is stable, and whether memory leaks exist (you can get back to me for more details).

  1. Please tell what you know does not work. There are issues in Winbox “everyone know about”, for example. In aviation, an aircraft does not need to have all systems operational – regulations allow to fly with some features inoperable, as long as minimum conditions are met (see Minimum Equipment List etc.). In Mikrotik world, the pilots would at least like to know what is broken, then we could avoid those landings in fog at night when MPLS is still not stable. Point 4 here, you may have noticed, is the opposite of point 3 above (where the claim of ¨it works" should be backed by relasing corresponding config).

While Mikrotiks are complex products, and while in my own experience 80% of the errors in the end were my own faults, I am still uneasy about some features of the platform. MPLS no way at the moment, BGP.. hmm… I don´t dare to touch it sometimes. IPsec … hmm… OSPF… hmmm…

Any input from other users about the use of MPLS PWs and/or OSPF stability?

regards,
azg

I also had problems with MPLS in v7, more precisely with old equipment such as CCR1036-12G-4S, RB2011/3011, in short, routers that do not come with v7 from the factory. Routers that came with v7 from the factory such as the RB4011 had much fewer problems. But I still see some bugs when switching from route A to route B, it drops the VPLS tunnels, for any reason, even changes in some settings that have nothing to do with it, in any part of the network and drops all the VPLS tunnels that pass through that router. I also had the problems you mentioned of losing access to the router, and sometimes I lost access to some networks that I use to access assets in the POPs, when some router with v7 in the middle of the path went crazy and I had to restart it. It was not easy to understand this problem or which router was crazy. MPLS needs improvements in v7 especially when there are 2 paths to transport a VPLS tunnel.

We ditched MPLS/VPLS a while ago. We also had issues with FIB/LFIB syncronization and traffic that simply got lost with certain topology scenarios. Now using EoIP and VXLAN instead.

I made a post a couple months ago asking if MPLS is finally stable yet on ROS and received no replies

According to the OP’s post in this thread I guess not. Still LDP gets out of sync and drops packets
But is it just LDP or is MPLS that has bugs? Has anyone skipped LDP entirely and just run BGP signaled VPLS with TE tunnels? Still issues or stable?