Interface(s) don't flow unless manually taken up/down

I have a couple of CRS305’s and have a strange issue where interfaces sometimes (in particular on boot/reboot) don’t flow until they are toggled down and then back up. I’m going to write a simple script to do this if RX or TX packets are only 0 or 1 for some period of time as a workaround, but it would be nice to understand why this is happening.

Seems similar to this issue, but this setting is only for X86
http://gregsowell.com/?p=2361

See attached screen shot showing stats for interface Host3… there are RX packets but no TX packets. So I believe that RX means the switch is receiving packets from the host but not sending anything back, correct?
Screenshot 2020-08-20 at 10.29.52 AM.png
This other side of the link is an ESXi VMware host. Within vmware, I can set the link status check for “link status only” or with “beacon probing” – but with the beacon there are many caveats about if it will work or not. Seems it wasn’t working last time I tried. The other settings I can choose on the VMware/ESXi side is “load balancing” with a choice of 3 routing options OR explicit failover order. I don’t think these are having any effect because whatever the problem is does not get detected as a failed link (until I manually reset link).
https://docs.vmware.com/en/VMware-vSphere/6.5/com.vmware.vsphere.networking.doc/GUID-D34B1ADD-B8A7-43CD-AA7E-2832A0F7EE76.html
https://kb.vmware.com/s/article/1005577
https://blogs.vmware.com/vsphere/2008/12/using-beaconing-to-detect-link-failures-or-beaconing-demystified.html

This may or may not have something to do with being 10GbE connections and I don’t think they auto-negotiate. Also, I am running MSTP between these two switches + one netgear switch (but can MSTP be the issue if most traffic never leaves the switch and is between host1/host2/host3 ports?)

Below is a network diagram. I have two MST ID’s with a couple vlans in each and they seem like they are working fine except for this link status not really working. If I disable an interface from a RUNNING/OK state, the traffic moves over to the other CRS switch as intended. But nothing is working in this “undetected failed” state where I guess ESXi and/or the CRS believe the link is OK when it is not. Something is blocking traffic somewhere and that seems to get resolved by reset of the interface.
server-topology.png

Been continuing research on this and found a number of things that seem possibly relevant.

This rather old post from 2013 talks about autonegotiation mismatches and problems they can cause. This seems plausible for my situation because I have not been able to get autonegotiate settings on the vmware ESXi side to stick. So I have disabled autonegotiation for now to see if that helps.
https://herdingpackets.net/2013/03/21/disabling-gigabit-link-negotiation-on-fiber-interfaces/

I can’t quite determine if autonegotiation should or should not be enabled or even if it is supported for these 10G DAC links. Seems to matter if it is 10G-baseT or if we are talking about 10GbE. Some standards say it is required. Then lots of people online are saying it is best practice to hard code and not use it on 10G. But as the above link says, a mismatch seems like a definite problem so I have disabled it since it is showing as disabled in my vmware environment.
https://networkn3rd.wordpress.com/2013/04/15/10g-auto-negotiation/
http://noahdavids.org/self_published/gigabit-AN.html
https://forums.juniper.net/t5/Ethernet-Switching/10Gb-SFP-Autonegotiation/td-p/314645

Then somehow I stumbled on “Unidirectional Link Detection” (Cisco UDLD) which seems to describe a problem that might manifest with similar symptoms to what I am seeing. Is there some similar protocol or implementation on Mikrotik?
https://en.wikipedia.org/wiki/Unidirectional_Link_Detection

Finally, I wonder if it could matter that I had flow control disabled?

Here is a script that I found on the forums here and modified to reset what I’m calling a “blackhole” condition because I came across that term in my above research and it sounded cool. I have not confirmed if my condition is the same as described elsewhere under that name.

The email feature from source script is just commented out here and I also have not cleaned up this script since roughly modifying it.

Usage: add your interface names under first line. Don’t modify the =0 part. Schedule to run as desired (I have mine set to run 3 and 8 min after boot but I think I stagger it on a redundant and cross-connected switch so they are not fighting each other). I am also allowing it to run every 15 or 20 min (even though I don’t really think I need it) and it doesn’t seem to be having any unwanted effects.

The script will…
(1) first check to see if the interface SHOULD be working (showing “link-up”) and then will set the interface = 1 if so inside ifaceList.
(2) Then it goes though interfaces again looking for any with under 2 pps RX/TX and then marks it with 911 inside ifaceList (911 being the emergency telephone number in the US). My error condition results in one side of a link flapping between 0 and 1 pps and seems to be never more than that.
(3) it will reset those interfaces with a delay
(4) check to see if fixed, and if not goto 3 and repeat a couple times


:local “ifaceList” { “host1”=“0” ; “host2”=“0” ; “host3”=“0” ; “switch1”=“0” };
:local iname;
:local monitor;
:local logMsg "ifacecheck: [info] [Port PPS Check] (tx/rx): ";
:local speedRX;
:local speedTX;
:local targetSpeed 2;
:local cycleNumber 3;
:local downtime 3;
:local sleepBetween 5;
:local problems 0;
:local actions 0;
:local trying false;


#Define variables for sending email
#:local mailServerName “PUT_YOUR_MAIL_SERVER_NAME_HERE”;
#:local mailServerIp [:resolve $mailServerName];
#:local mailServerPort PUT_YOUR_MAIL_SERVER_PORT_HERE;
#:local mailFrom “PUT_MAIL_FROM_HERE”;
#:local mailTo “PUT_MAIL_TO_HERE”;
#:local mailSubject “WRITE_YOUR_MAIL_SUBJECT_HERE”;
#:local mailUser “PUT_YOUR_MAIL_USER_HERE”;
#:local mailPass “PUT_YOUR_MAIL_PASSWORD_HERE”;
#:local mailBody “PUT_MAIL_BODY_HERE”;

define sendMail function

#:global sendMail do={

/tool e-mail send server=$mailServerIp port=$mailServerPort from=$mailFrom to=$mailTo subject=$mailSubject body=$mailBody user=$mailUser password=$mailPass;

#}


\

INITIAL CHECK ALL INTERFACES

:foreach ethName,failCount in=$ifaceList do={
:local loopCounter 0;

verify iface is enabled/showing OK before running checks

:set $monitor [/interface ethernet monitor $ethName as-value once];
:if ($monitor->“status” = “link-ok”) do {
:set ($ifaceList->$ethName) “1”;
}


:set $monitor [/interface monitor-traffic $ethName as-value once];
:set $currentSpeed ($monitor->“rx-packets-per-second”);
:set $speedRX ($monitor->“rx-packets-per-second”);
:set $speedTX ($monitor->“tx-packets-per-second”);

CHECK THE INTERFACE

:if (($speedRX < $targetSpeed || $speedTX < $targetSpeed) && $ifaceList->$ethName = “1”) do {
:log warn “ifacecheck: <41> critical Port $ethName.supafab.com current pps (tx/rx): $speedTX/$speedRX, target pps > $targetSpeed”;
:set ($ifaceList->$ethName) “911”;
:set $problems ($problems + 1);
}
:set $logMsg ($logMsg . "$ethName.supafab.com: $speedTX/$speedRX; ");
#end of single interface check


}

:log info $logMsg;

REMEDIATION FOR FAILED (if any)

:while (($loopCounter < ($cycleNumber - 1)) && ($problems > 0)) do={

:set $loopCounter ($loopCounter + 1)
:if (!trying) do {
:log error "ifacecheck: Starting interface reset procedure >>> ";
}

\

DISABLE ANY BAD

:foreach ethName,failCount in=$ifaceList do={

#:log debug "ifacecheck: the array value for $ethName is ";
#:log debug ($ifaceList->$ethName);

:if (($ifaceList->$ethName) = 911) do {
:log warn “ifacecheck: [critical] disabling bad interface $ethName.supafab.com - try $loopCounter”;
/interface ethernet disable $ethName
:set $actions ($actions + 1);
}
}

end disable

WAIT

#:log debug “ifacecheck: [debug] pausing for $downtime before re-enable”;
:delay $downtime;

RE-ENABLE

:foreach ethName,failCount in=$ifaceList do={
:if (($ifaceList->$ethName) = 911) do {
:log warn “ifacecheck: warning - re-enabling interface $ethName.supafab.com - end of try $loopCounter”;

:set $ethName disabled=no;

/interface ethernet enable $ethName

}
}

end re-enable

:delay ($downtime * 3);

RECHECK

:local logMsg "ifacecheck: [info] [Port PPS Check] (tx/rx): ";

#run through ifaces again
:foreach ethName,ethStatus in=$ifaceList do={

#if this interface was reset then do stuff
:if ($ethStatus = 911) do {
:set $monitor [/interface monitor-traffic $ethName as-value once];
:set $currentSpeed ($monitor->“rx-packets-per-second”);
:set $speedRX ($monitor->“rx-packets-per-second”);
:set $speedTX ($monitor->“tx-packets-per-second”);

:log debug “ifacecheck: recheck $ethName.supafab.com RX: $speedRX pps TX: $speedTX pps”;
:set $logMsg ($logMsg . "$ethName.supafab.com: $speedTX/$speedRX; ");

DID WE FIX IT?

:if ($speedRX > $targetSpeed && $speedTX > $targetSpeed) do {
:log info “ifacecheck: Interface target pps $targetSpeed restored for $ethName.supafab.com”;
:set ($ifaceList->$ethName) “1”;
:set $problems ($problems - 1);
} else {

:log warn “ifacecheck: [critical] $ethName.supafab.com is still bad in check loop $loopCounter of $cycleNumber”;


}

end of did we fix it if else

}

end of if failcount > 0

}

end of foreach recheck loop

:set $trying true;
:if ($loopCounter = $cycleNumber) do { }

:log info $logMsg;
:log debug “ifacecheck: end of remediation round $loopCounter”;

:delay $sleepBetween;

}

end big remediation loop


\

EMAIL NOTIFICATION

:if ($actions > 0) do {
:log info “ifacecheck: [info] Trying to send email alert - residual problems: $problems; reset attempts: $actions”;

$sendMail;

} else {
:log debug “ifacecheck: debug - woo hooo we checked and there were no blackhole interfaces!”;
}