I have a crs328-4c-20s-4s+ that is having the following weird issue:
ISSUE:
When I plug in an SFP in SFP port 5 I get “PSU1 entered state FAIL” and “PSU2 entered state FAIL”. Notice that to replicate this I don’t need to actually connect a cable to a device/client; just plugging in a fiber SFP produces this behavior. This ONLY impacts the operation of sfp5 (or is the hardware on sfp5 bad and just causing the log message?) which auto-negotiates to 1000mbit half duplex and goes down every couple of minutes. I also notice fluctuation in tx power and sometimes complete absence of it (the tx power field is just empty). RX power does not have the same issue
ATTEMPTS TO RESOLVE:
So far I have 1) rebooted the switch, 2) updated the firmware to 6.49.10, 3) updated routerboard firmware to 6.49.10, 4) changed power cables (without cutting off power, first changed one and then the other), and 5) tried multiple fiber sfp from multiple vendors. Nothing resolves the problem, neither the log message (psu says state fail) nor the sfp5 malfunctioning. The only thing I have NOT tried is to power off the switch completely and power it back ON (I will have to be here physically during dead time to do that).
WHEN DOES THE MESSAGE GO AWAY:
The log message about the PSU only goes away if I unplug the SFP. “PSU1/2 returned to state OK”
WEIRD BEHAVIOR:
The issue does not occur when plugging in an rj45 sfp, although I have not left plugged in for long (maybe it happens later). The port auto-negotiates to full duplex and does not show the error message.
Configuration-wise it’s the same as all other ports on the switch so it could not be a config issue. You got any ideas as to what exactly is happening? Any suggestions as to how to resolve it?
Thanks for your time guys!
It seems that there’s single i2c bus in your device and that both power supplies’ management and SFP’s DDM interfaces connect to that bus. And if sone device hogs that i2c bus, then polling status of other devices times out. It’s hard to tell why SFP5 seems to be a problem, could be it’s (manufacturing?) deffect of SFP5 cage which only breaks things when module is inserted … or is it module itself at fault (then the problems would start when inserting that module into any of SFP cages) … or is it a software (ROS) bug, triggered by some weird condition.
Try to narrow down the conditions when problem occurs … and if you feel it’s a software bug, take supout file (while the problem is active) and send it to MT support.
well I have tried a lot of things. I can tell you for sure that the module is not at fault since I have tried multiple sfp modules from multiple vendors and get the same result; plus, I have also tried switching a working module (from sfp10 specifically) with the one I was using in sfp5 and again the known working module (from sfp10) exhibited the same behavior in sfp5. sfp10 worked fine with the module that was previously in sfp5.
I can see your idea of a single bus holding some water considering that on another crs328-4c-20s-4s+ (we got plenty of those) a faulty module brought down the entire switch (only power leds working and not even serial port produced any output!) until it was removed and the switch rebooted but that does not seem to be the case here (see above). What makes me go bonkers is that a rj45 module works fine (no problem at all) unless those work differently than fiber sfp modules in terms of their electronic circuitry and is normal that one causes the issue and the other one does not. Anyway, I’m not well-versed in circuitries (had to google to understand your suggestion about the i2c bus).
What are we left with then? A software bug that has occurred in just one of more than ten identical switches? Any suggestions as to how to narrow down the problem even further?
By shuffling SFP modules around you more or lesd prooved it’s likely a matter of faulty SFP cage. I’ve no idea if it can be (easily) repaired. So you should avoid using it (yeah, I know). Or try to find a module that doesn’t trip the problem and is useful to you, my hunch is that a non-DDM module might be fine.
Seems like you’re on to something here. The only non-DDM modules I got available rn are those s-rj01 from mikrotik and those are working fine. If I understand correctly there must be a problem with the bus connecting to sfp5 and relating to the DDM controller in the optical sfp module. Anyway, I will attribute the issue to some defect in the cage and just not use it. Thanks so much for giving me your insights, mkx!
Update: Damn it, issue is happening again on another switch. Think I will take the supout path. Anybody from Mikrotik got any idea what might be wrong? The same problem has occurred with firmware 6.49.8 and 6.49.10. I suspect it’s sfp14 this time, although I’m not sure because the switch is not physically near me.