So now that I have around 50 clients spread across 6 base stations with 3 Internet feeds using OSPF, with clients linking in using station-wds so I can use PPPoE servers on adjacent routers to my Internet feeds, I can say I have been monitoring the "port flaps" and can rule out some theories and also offer a possible short fix solution.
First off, from what I've read and from trying stuff myself, I can say that changing cables and ends is not a solution, although I would make sure you have ruled this out in the first step to diagnosing the problem. If you have multiple cables running up a tower to multiple RB's, then once taking care of your routing, swap the cables around to see if the problem moves to a different RB or follows it on the port. If it follows, it's the RB, if not, it's the cable and or connectors.
I have port flaps on clients and base stations, one base station port flap in particular seems to have been cured by turning off Auto-Negotiation and setting it to 10Mbps. This is from an RB750 at the base connected to a Sextant at the top using my own flavour of PoE injector.
I had tried swapping cables, ports and ends but it kept happening, now for the last 5 days the link has stayed up by forcing it to connect at 10Mbps. Fortunately I use this link only for monitoring part of my network. Although it can and will route all traffic from one of my Internet feeds through to another should the closest fail.
In this case I would say the problem is with the Sextant because the logs show it drops out for a second or 2, the port light on the RB750 also goes out, but the power stays up, so it's the Sextant that is reporting ether1 down.
While spending months monitoring this, I have deduced that when the link drops, OSPF goes into panic mode causing an OSPF storm across the network because this link is central in my network.
I have also noticed on a few occasions, that when I access this Sextant through the Dude, the link would fail either a few seconds just before, or just after I log in. Now, because this is a link into the centre of my network, my Dude would never be stable showing all links as up, I normally only use the ping and http services for clients and ping, router, mikrotik, telnet and dns services on my backhaul routers and when it's not stable I get orange everywhere.
In cases where the link would drop rapidly, OSPF would drop the adjacency and cause a storm across the network. To stop this from happening, I would disable the OSPF instance on the router showing the ether1 down log for a few, maybe 5 seconds, then enable it again and the descriptor would announce the link exchange state, so any wrong state messages would disappear and OSPF would return to normal.
All my connected routers are using OSPF and my clients are mainly holiday Villas where they remove the electric key card when leaving thus switching off the electric, so at all times of day, OSPF is being updated with clients connecting and disconnecting. This in itself does not cause an OSPF storm, by which I mean where many adjacent routers cycle through: invalid sequence, discarding packet, MD5 auth failed, exchange and 2-way to init or down states causing all connected routers to continually update adjacency status to the point the network becomes slow due to the amount of OSPF traffic.
FYI I am using a bridged network with RSTP enabled and when there's no backhaul problems, it works perfectly. I have been at a client trying to diagnose this very problem where they are connected in using an SXT connected to an indoor TP-Link and using PPPoE to dial in.
They connect to the first replied AC which is one hop from the Internet feed, so the routes go from client to PPPoE server (AC), to Internet. Now whenever the port flaps, because of the timing, the PPPoE session stays alive and does not cause interference or down time for the client.
If the port flaps successively, like 2 or 3 times within seconds of eachother, then OSPF announces the link as down, but the PPPoE session stays up so long as OSPF sorts itself out before the PPPoE session becomes stale. During this time the clients still has Internet connectivity.
To prove this client's cable and connections were fine, I pulled a 1GB FTP test file
across the connection in around 20 minutes without it dropping once, then after returning home and checking the logs, it had dropped out twice in the 10 minutes after I left. There's no rhyme or reason to the port flap, except I believe it's a software initiated drop and doesn't seem to be related to hardware, power supplies or PoE injectors.
ROS versions or firmware
In my opinion and testing none of these alter the port flap, it remains as random as it always is, although I would say that when lots of different traffic (Routing/UDP/TCP) is going across it, it is more likely to happen than at night for example when there's very little of any traffic. I even thought of temperature since at the moment we are having around 40-45 degrees C during the days, but still no pattern is emerging.
Run a 5 second script to search the logs and watch for the common messages denoting ether1 down and OSPF failure messages, then disable the OSPF instance for 5 seconds and re-enable it to make it autonomous for large networks where you can't monitor it 24/7.
Allow MikroTIK access to your most problematic router so they can check it while it's happening and perform whatever covert operations they want to find the cause. Supout etc.
MikroTIK guys, you need to understand there's a lot of posts on this problem, so you cannot bury your heads in the sand by asking us to recreate it so you can diagnose it, if we knew how to recreate it, then we'd know how to cure it.
Choose someone like WirelessRudy and gain access to one of his problem routers and monitor it yourselves. We spend enough time trying to run businesses and keep our networks running and our customers happy to then spend more time waiting for our network to fail just so you can try and fix it.
I am sure this is a software problem as already suggested and I think OSPF and possibly the ethernet negotiation scripts are "too sensitive" to small changes in electrical changes on the ports.
I'm a mechanic/engineer/software programmer/general dogs body by trade and I can do things with my hands better than most, but I can't be arsed to continue spending hours reading through pointless posts on this forum from people with exactly the same problem. You need to experience this for yourselves and the only way you can do that is spend time monitoring it yourselves.
I'm sure WirelessRudy can give you access to a router you may only need to monitor for an hour or 2 before it happens, so when it does, you can just update this forum thread with solutions not more questions or requests for more information.
I'm getting fed up with feeling like so many others have said before, a "beta-tester for your software".
Your products and software have become the lifeblood of many people on this forum, it is their livelihoods, a lot like me have spent all their money and time creating something based on your products, please re-pay that devotion to your products by devoting 2 hours to finally see what we are talking about and don't keep asking us to recreate it.
As a mechanic I used to have customers coming to me with a problem that wouldn't happen when they turned up at the garage, but I would then give them another car and use it for a day or two so it would happen when I was using it, this was sometimes the only way I could actually solve a problem, rather than telling them to come back if/when it happens it again.
WirelessRudy is right by creating this thread a whole year ago, but you are no closer to a fix. Please, I cannot keep wasting my time reading this forum in the hope problems have actually been fixed, I need to know that for each thread that is opened, a solution is given within a few posts, not 100's and you guys need to understand that with your help, a LOT OF NETWORKING GURUS using your products would be on your side and help you to iron out problems, maybe even offer to be beta-testers to help you push your product forward.
So from now on, I will only check this forum thread for a solution, IE, I will only look at the last post which should be the solution, not yet another customer complaining about your software or lack of commitment to fix it or another request for people to recreate it.
Gain access to a router that is doing it and monitor it until it happens, then fix it and post the solution.
Regards to you all,