New Ethernet port flap issue enquiery, PLS JOIN!

WirelessRudy · August 7, 2011, 12:49am

My 200+ routerboards v5.6 running network has still several units that show the Ethernet port flap on more or less regular basis.
It seems MT cannot really get a grip on this issue. They can’t reproduce it.

I have some ideas in where the issue might be found; ESD, or poor below standard cable connections.

The issue in my case is randomly spread over my network, but always between routerboards and 3rd party devices (laptops, PC’s, Wifi routers, switches) and the log shows clearly the port flaps mainly occur during daytime.
I live in a dry climate with moderate high summer temps and a very strong day/night effect when it comes to the prevailing winds in the summer.
It looks like most of the flaps start when the winds starts to blow some hours after the sun started to climb the sky. Most flaps die soon after the winds died as well…

99% of my CPE antenna’s are not earthed since that is almost impossible to do and in the pre 5.x era never had issue with that neither.
I use in general standard non-shielded utp cable with non-shielded plugs on both ends.
I use a whole range of powershots.

Basically I should investigate and make an inventory of the network on which units have the issue and which not and combine/compare that with their physical connections to clients device, and what device they have.
This is going to be a time consuming task, the only resource I am always short of! So I am asking my fellow users with the same issue to report whatever they can on this topic with as much info as possible to help find the issue.

MT will definitely read this topic and get their findings from it, as well as we could.

So, PLEASE report your Ethernet port flap! But make it as complete as possible. Users that might have a laptop or PC directly connected to a rb will also give a Ethernet port flap if that PC/Laptop goes in hibernation. This might happen regularly and in a quick overview in the log it might show like you have a port flap issue while in fact you have not…

Some reported that they had a unit with this issue which in fact also was reported by the ´other end´ of the cable’s unit. But we need to be sure what the reason is for this. There can be several reasons that make Ethernet links drop.
I have some 30% of my units with Ethernet port flaps happening within the same or max. 2-3 secs that I cannot place as being the result of client behaviour or other actions or failures. Since clients are not complaining, nor did they say to notice anything when asked I have the believe the ´port-flaps´ are merely report errors of the OS.

My ´feeling´ is that small ESD discharges might trigger the ros reporting system to produce a port disable, while in reality I the link is really broken…
If this would be the case then there are 2 solutions:

MT fixed the sensitivity of the ros reporting Ethernet issues.
I just disable the reporting of Ethernet status message in my logs. (And miss any real problems.)

Please help on this issue. I’ll bet the ones having this issue get as much irritated by the filling of the logs as I am!

Ivoshiee · August 7, 2011, 6:48am

I have that port flapping issue on my RB750G. I used ether1 for outside connection (Cisco cable modem) and there that port flapping really does drop connection and is not just a pure cosmetics. In some occasions I had over 1 hour of no Internet access. That may be some other issue as well, but even during that time there were numerous port flapping incidents.
Couple days ago I moved the cable modem to ether3 and I haven’t seen any port flapping on that port. Client PC/laptop is now in ether1 and I still see port flapping there.
Note: In case of the cable modem in ether1 the line speed did come up as 1000M all the time, but now ether1 is reporting occasionally 10M and then 1000M.

All my boxes are running ROS v5.5.

Edit: Cables are standard 1,8m variants, there should be no issues with these.

WirelessRudy · August 7, 2011, 10:28am

Ivoshiee:

I have that port flapping issue on my RB750G. I used ether1 for outside connection (Cisco cable modem) and there that port flapping really does drop connection and is not just a pure cosmetics. In some occasions I had over 1 hour of no Internet access. That may be some other issue as well, but even during that time there were numerous port flapping incidents.
Couple days ago I moved the cable modem to ether3 and I haven’t seen any port flapping on that port. Client PC/laptop is now in ether1 and I still see port flapping there.
Note: In case of the cable modem in ether1 the line speed did come up as 1000M all the time, but now ether1 is reporting occasionally 10M and then 1000M.

All my boxes are running ROS v5.5.

Edit: Cables are standard 1,8m variants, there should be no issues with these.

Thanks Ivoshiee. This is the kind of report we are looking for.
By reading your post I realize that there were problems in the past with the Ethernet1/PoE port. Our port flap issue is also usually the Ether1/PoE port and we have seen other problems with these in the past. After release of v.4.x there was the issue that this port suddenly ´died´ for receive traffic, or died completely on running units for no reason. After a reboot it usually came back. (But I also had some 411’s where the port died after the first initial upload of a working system by means of a script. Some of these PoE ports became unrepairable by way of software upgrade/re-installs.) This was all about a year ago.

In my central unit (with a bunch of rb’s interconnected to each other, including 2 rb1000’s) the issue came up every so many days and was repaired by interconnecting all routerboards and housings to daisy chain earth connection. Problems disappeared since. The problem here was obviously either ESD or current ´leakage´ over the boards and the Ethernet connector.

Anyway, you also report that the port flap really means a disconnect of the physical line.
On the other hand, your disconnect last a very long time. Where all my port flaps usually happen within the same second or the port comes back within 2-3 secs at worst.
(All according my log. Imho it makes me even more think it is merely a reporting issue than a ´real´ issue. See my first porst.)

Davis · August 7, 2011, 2:27pm

WirelessRudy:

Thanks Ivoshiee. This is the kind of report we are looking for.
By reading your post I realize that there were problems in the past with the Ethernet1/PoE port. Our port flap issue is also usually the Ether1/PoE port and we have seen other problems with these in the past. After release of v.4.x there was the issue that this port suddenly ´died´ for receive traffic, or died completely on running units for no reason. After a reboot it usually came back. (But I also had some 411’s where the port died after the first initial upload of a working system by means of a script. Some of these PoE ports became unrepairable by way of software upgrade/re-installs.) This was all about a year ago.

In my central unit (with a bunch of rb’s interconnected to each other, including 2 rb1000’s) the issue came up every so many days and was repaired by interconnecting all routerboards and housings to daisy chain earth connection. Problems disappeared since. The problem here was obviously either ESD or current ´leakage´ over the boards and the Ethernet connector.

Anyway, you also report that the port flap really means a disconnect of the physical line.
On the other hand, your disconnect last a very long time. Where all my port flaps usually happen within the same second or the port comes back within 2-3 secs at worst.
(All according my log. Imho it makes me even more think it is merely a reporting issue than a ´real´ issue. See my first porst.)

May be you can setup some monitoring software (e.g. The Dude) to monitor these flapping links by ping each second or even more often to see whether link flapping is only a reporting issue or a real flapping.

Ivoshiee · August 8, 2011, 8:22am

I have had Netwatch scripts set to monitor cable modem and network gateway. As soon as the ether1 was reported as down then came the message as modem host being down as well. So, that is a real problem for me and not just reporting issue.

Ivoshiee · August 8, 2011, 8:27am

That long loss of connection is just a few such ones I experienced. Usually these connectionless gaps are between 10…15 seconds. I haven’t checked if these all are reporting as interface being down, but I strongly suspect that.
Edit:
Snippet from the log:

11:43:45 interface,info ether1 link down
11:43:55 interface,info ether1 link up (speed 1000M, full duplex)
11:46:19 interface,info ether1 link down
11:46:21 interface,info ether1 link up (speed 10M, full duplex)
11:46:27 interface,info ether1 link down
11:46:31 interface,info ether1 link up (speed 1000M, full duplex)
11:47:23 interface,info ether1 link down
11:47:24 interface,info ether1 link up (speed 10M, full duplex)
11:47:25 interface,info ether1 link down
11:47:28 interface,info ether1 link up (speed 1000M, full duplex)
11:47:49 interface,info ether1 link down
11:47:52 interface,info ether1 link up (speed 1000M, full duplex)
19:14:46 interface,info ether1 link down
19:14:47 interface,info ether1 link up (speed 10M, full duplex)

WirelessRudy · August 8, 2011, 10:19am

I called my worse effected client this morning, he has 30 to 40 Ethernet port flaps a day (during daytime, when his PC is ‘on’).
I asked him if he sensed any problems with disconnects but he can’t really remember some…
He uses Skype regularly but never had conversations broken in the middle.

I took a close look in the log and see the ether1 link going up and down but than every time actually the connected speed changes from 10M to 100M and vice versa.
My CPE (generating these ether1 link messages) is directly by cable connected to his PC.
His PC goes into standby and later in hibernation which than would switch the Ethernet interface of the PC off…
I can’t ping his PC though to see if the link really drops. Probably it is being blocked by windows firewall? (Anybody has a say about this. Is this ’ default’ behaviour of Windows Firewall?)

The strange thing is though that almost all his port flap log shows the link as down and again up within max. 2 secs. When it than is up it sometimes stays up several minutes up to half an hour to go down again and back up in the same 2 secs but with another speed.

Ivoshiee · August 9, 2011, 2:25pm

I am supposed to have permanent IP from my ISP, but I can not set it static and router must ask it over DHCP. My reported long Internet loss is likely due to usage of DHCP on that interface. The port flipping happened on bad time thus the cable modem lost its MAC-IP-address registration, but Mikrotik didn’t release that IP-address and Internet resumed working only after DHCP renew when DHCP client lease ran to 0 (lease expiration time is set to 2 hours).

ohara · August 9, 2011, 7:58pm

Rudy,
based on your experience, does port flapping also occur while using passive PoE splitters to power the RB in an outdoor enclosure?

I have deployed two 5.6 RB433 recently. One on a rooftop with a 50 meters FTP cable sent down the building without fixing it to the elevation. We were planning to fix the cable the next day, but the router had been turned on anyway. The installation has undergone a stress test during the night while a severe storm passed over the area. The port was flapping for 15 minutes. Once the storm has gone the port flapping stopped. I am wondering if this was caused by the cable being smashed against the wall by the wind.

Another RB was installed on a balcony where the cable was put under a synthetic carpet. I was noticing port flaps during daytime and I came to the conclusion that the customer might have been causing the port flaps by simply walking on the balcony.

I feel that that port flaps are caused by the cable being exposed to external factors whereas the cable should always be in complete standstill. This may be layman’s thinking, but still appreciate your feedback on that.

WirelessRudy · August 9, 2011, 10:48pm

We use these only. But maybe an idea. I can test a unit one of these days with a power splitter/take-off in a hampered unit. This way the power will not come to the board via the utp cable but via the jack. Maybe that helps… we’ll see.

I have deployed two 5.6 RB433 recently. One on a rooftop with a 50 meters FTP cable sent down the building without fixing it to the elevation. We were planning to fix the cable the next day, but the router had been turned on anyway. The installation has undergone a stress test during the night while a severe storm passed over the area. The port was flapping for 15 minutes. Once the storm has gone the port flapping stopped. I am wondering if this was caused by the cable being smashed against the wall by the wind.

Another RB was installed on a balcony where the cable was put under a synthetic carpet. I was noticing port flaps during daytime and I came to the conclusion that the customer might have been causing the port flaps by simply walking on the balcony.

I feel that that port flaps are caused by the cable being exposed to external factors whereas the cable should always be in complete standstill. This may be layman’s thinking, but still appreciate your feedback on that.

All my cables are well attached to walls and roofs and masts/poles. But that is more out of tidiness and good conduct. I don’t think physical movement of the utp cable would be an issue as long as the force is not driving the twisted wires apart or squeeze the copper core through the insulation against another cable. A synthetic carpet on the other hand could cause static electricity flows that can have effect on the date transport in the UTP cable. But if the flapping would have been caused by someone walking over it?? Well, everything is possible. But I think that change is relative small…

ohara · August 10, 2011, 6:11am

Hi,

Apreciate if you could post test results. I have ordered some splitters yesterday but won’t be able to conduct any tests within the next few days.

Skaught · August 24, 2011, 10:55pm

We are seeing these issues and we are the NOC for 4 different MT based ISPs in different locations across North America.

The router drops it’s link down to 10mbit randomly and also refuses to link at all if we hard code both ends to 100mbit full.

We have hundreds of rb’s using ethernet only that show these issues. They are connected to a mix of other RB radios, ubnt radios, network management devices, Fibre transceivers and client devices. We see the issues all over the place in a roughly random pattern. The issue is random enough that if you only had a hundred MTs in use, you may not even notice it. We see it constantly as we have so many devices deployed.

The only variable that is the same throughout is MT. These networks have all been built by different people thousands of miles apart. We thought it was issues with bad cables so we TDR’d as many as we could, and they all certified with flying colours. These are armoured cables that are properly bonded and surge suppressed if they run outside.

Our crews are Journeymen Telecom Electricians with years of experience who build to and exceed code in all workmanship. This is not an installation or ESD problem. This is not a reporting problem, the other devices report the slow link, flapping and issues. Speeds also slow to a crawl when the link drops to 10mbit.

We have the latest firmware, opened a ticket, and have sent in .rif files. No solution in sight.

elgo · August 25, 2011, 11:38am

post deleted, not related.

n21roadie · August 25, 2011, 12:54pm

@Rudy have you tried powering the boards using two core cable and plug in the dc direct to test?
or put a 12v battery next to unit for test

ohara · August 25, 2011, 8:11pm

just noticed that none of the port flapping messages appears in the remote kiwi syslog even though remote logging for ‘interface’ is set.

Makes kind of sense - if the connection is down then no data about it will be sent to the remote syslog. This makes troubleshooting much more difficult as one has to go into winbox logs for spotting port flaps.

Ivoshiee · August 29, 2011, 9:02pm

I’ve monitored my RB750G and that issue is more likely to affect ether1 (and ether2) ports - I see several port down/up and speed variation messages on those ports a day, but any port higher there is that not so frequent if any.

Note: No POE in use from ether1 is in use and ISP modem is still about 1,8 meters away on the same CAT5 cable.

jfartak · August 30, 2011, 7:32am

I can confirm too, that RB411 (v5.6) has the same issue. Even though nothing is writen to the MK’s log when this occurs, but on the switch to which is this board connected (either via PoE or non PoE port), something like this in log appears (MK is in Port :

[474] 14:40:02 2011-06-27
“STA topology change notification.”
level: 6, module: 5, function: 1, and event no: 1

[473] 14:40:00 2011-06-27
“STP port state: MSTID 0, Eth 1/8 becomes non-forwarding.”
level: 6, module: 5, function: 1, and event no: 1

[472] 14:39:58 2011-06-27
“STP port state: MSTID 0, Eth 1/8 becomes forwarding.”
level: 6, module: 5, function: 1, and event no: 1

[471] 14:39:55 2011-06-27
“Unit 1, Port 8 link-up 100M FD notification.”
level: 6, module: 5, function: 1, and event no: 1

[470] 14:39:54 2011-06-27
“STP port state: MSTID 0, Eth 1/8 becomes non-forwarding.”
level: 6, module: 5, function: 1, and event no: 1

[469] 14:39:54 2011-06-27
“Unit 1, Port 8 link-down notification.”
level: 6, module: 5, function: 1, and event no: 1

[468] 14:39:29 2011-06-27
“STP port state: MSTID 0, Eth 1/8 becomes forwarding.”
level: 6, module: 5, function: 1, and event no: 1

[467] 14:39:27 2011-06-27
“Unit 1, Port 8 link-up 100M FD notification.”
level: 6, module: 5, function: 1, and event no: 1

[466] 14:39:22 2011-06-27
“Unit 1, Port 8 link-down notification.”
level: 6, module: 5, function: 1, and event no: 1

This is weird, because if you’re using any network protocols (RSTP/OSPF…) this can lead to generating “unreasoned” events in the network. Furthermore, in such scenario, the IP watdog (if configured) can trigger reboot (when link goes up and RSTP makes port reachable and forwardable).

What is even worse, this happen too on our several RB433/AH running 5.6. This is another switch log example (MK’s switch port 18):

[324] 08:04:24 2011-08-30
“Unit 1, Port 18 link-up notification.”
level: 6, module: 5, function: 1, and event no.: 1

[323] 08:04:18 2011-08-30
“Unit 1, Port 18 link-down notification.”
level: 6, module: 5, function: 1, and event no.: 1

[322] 08:04:06 2011-08-30
“Unit 1, Port 18 link-up notification.”
level: 6, module: 5, function: 1, and event no.: 1

[321] 08:03:41 2011-08-30
“Unit 1, Port 18 link-down notification.”
level: 6, module: 5, function: 1, and event no.: 1

[320] 08:03:32 2011-08-30
“Unit 1, Port 18 link-up notification.”
level: 6, module: 5, function: 1, and event no.: 1

Nothing found in the MK log, but the switch (other switch model, other site) logged it. As you can see, some “outages” are quite long, so we suspect (when such flap occurs), that it is causing random reboots, when IP watchdog si configured.

WirelessRudy · August 30, 2011, 11:24am

jfartak; maybe a strange question; you do have the logging topic “interface” enabled in the routerboards?
If not the log won’t show Ethernet interface status in ROS.

jfartak · August 30, 2011, 12:24pm

You’re right - topic “interface” was not included, so it makes a sense, that nothing was visible in the log. So I set up logging on some most problematic boards and I will check them. However, the fact, that topic was not included makes the problem revealing more harder, because only switches logs showed up what happened.

I would give you Karma - but don’t know how to do it . I had register myself today here and up to now, I was just an reader, not writer. However, this “flap problem” made me to do it .

Really, we have about 200 RB4xx boards too in our network and as time comes, we encountered more and more random reboots (due to IP watchdog - the event was logged in MK log) and RTSP topology changes (where RSTP used) when they shouldn’t be (on normal conditions - no link overloadings, attacks, power interrupts etc.). And in switch logs, we found mysterious flapping of the ethernet interface, even though cablings (factory made patch cables) were measured and eventually changed, but without success. Boards were powered by the AC/DC adaptors, not by PoE.
So I started to look over the net, if somebody other has similar problem and found quite many posts (even in our Czech ISP MK forums).

normis · August 30, 2011, 12:25pm

I would give you Karma - but don’t know how to do it .

click on the “plus” button under his name on the left.

I had register myself today here and up to now, I was just an reader, not writer. However, this “flap problem” made me to do it .

did you also contact support? this is a community support. developers don’t read everything.

New Ethernet port flap issue enquiery, PLS JOIN!

[474] 14:40:02 2011-06-27 “STA topology change notification.” level: 6, module: 5, function: 1, and event no: 1

[473] 14:40:00 2011-06-27 “STP port state: MSTID 0, Eth 1/8 becomes non-forwarding.” level: 6, module: 5, function: 1, and event no: 1

[472] 14:39:58 2011-06-27 “STP port state: MSTID 0, Eth 1/8 becomes forwarding.” level: 6, module: 5, function: 1, and event no: 1

[471] 14:39:55 2011-06-27 “Unit 1, Port 8 link-up 100M FD notification.” level: 6, module: 5, function: 1, and event no: 1

[470] 14:39:54 2011-06-27 “STP port state: MSTID 0, Eth 1/8 becomes non-forwarding.” level: 6, module: 5, function: 1, and event no: 1

[469] 14:39:54 2011-06-27 “Unit 1, Port 8 link-down notification.” level: 6, module: 5, function: 1, and event no: 1

[468] 14:39:29 2011-06-27 “STP port state: MSTID 0, Eth 1/8 becomes forwarding.” level: 6, module: 5, function: 1, and event no: 1

[467] 14:39:27 2011-06-27 “Unit 1, Port 8 link-up 100M FD notification.” level: 6, module: 5, function: 1, and event no: 1

[466] 14:39:22 2011-06-27 “Unit 1, Port 8 link-down notification.” level: 6, module: 5, function: 1, and event no: 1

[324] 08:04:24 2011-08-30 “Unit 1, Port 18 link-up notification.” level: 6, module: 5, function: 1, and event no.: 1

[323] 08:04:18 2011-08-30 “Unit 1, Port 18 link-down notification.” level: 6, module: 5, function: 1, and event no.: 1

[322] 08:04:06 2011-08-30 “Unit 1, Port 18 link-up notification.” level: 6, module: 5, function: 1, and event no.: 1

[321] 08:03:41 2011-08-30 “Unit 1, Port 18 link-down notification.” level: 6, module: 5, function: 1, and event no.: 1

[320] 08:03:32 2011-08-30 “Unit 1, Port 18 link-up notification.” level: 6, module: 5, function: 1, and event no.: 1