Unfortunately, it looks like there’s something in the way the PCIe card reboots that upsets HP servers.
When I reboot the CCR, the server INSTANTLY hangs, with errors like “Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000013, Status 0xFE200000’000C110A, Address 0x00000000’80500000, Misc 0x44FC3816’00402086)” and “PCI Bus Error (Slot 2, Bus 0, Device 2, Function 0)”
Has anyone managed to get these cards working with enterprise-ish servers at all? I can understand that a normal PC without all the failsafe/monitoring would probably not notice, or even mind, that a PCI card vanishes, but these bigger servers obviously do!
This SPECIFIC host is a HP DL360 G9, with 2 x E5-2687W v3’s, but I suspect that almost all servers are going to have a similar problem.
Or perhaps not. If reboot of this device indeed makes it (momentarily) vanish from PCIe bus, then this indeed causes problems to hardware and OSes that can not handle hot-plug events on PCIe. This is fairly recent development (e.g. less than 10 years) and not all hardware engineered around that time already supports it … which is true for OS kernels as well. I would expect that latest hardware (e.g. HPE servers gen11) and OS (e.g. linux kernels 5.x …) handle this without issues. Older servers (possibly gen9 is affected) and older OSes (e.g. linux kernel 4.x and older, Windows server 2016 and lower) don’t expect this to happen and freak out.
Mind that a proper CCR2004-PCIe reboot may be harder on PCIe host than reset of chip on a more ordinary PCIe card …
It causes a crash even if the machine is still in POST. It’s pretty clear that the ILO equipped devices are far more picky about how pcie devices are hotplugged than a normal dumb machine. I plugged it into a generic dell PC and rebooting the CCR didn’t bother the PC at all.
I don’t have a Dell server with a DRAC, so I can’t check with that, but two HPs servers both crashed with similar errors when the 2004 was rebooted.
There’s obviously something that the HPs don’t like about them unplugging and replugging themselves - even though I can literally PHYSICALLY unplug a normal NIC while the machine is on.
Another possibility would be power surge following full board reset … it could momentarily overload PCIe bus power capabilities. I wasn’t able to find a good reference on that, but from the top of my memory PCIe capability is somewhere around 30-50W, possibly depending on PCIe version and particular mainboard implementation.
Typical max PCIe slot power is 75W and it’s quite common for GPUs or things like AI/Crypto accelerators to use all of it… continously.
Mikrotik card should not be consuming so much power. And any brief spikes should be smoothed out by capacitors on board.
So I’d say the hardware side of this problem is far less likely then some software issue in ILO or BIOS, especially if it keeps happening just on range of HP servers and not other vendors…
When initially reading about this card I though “wow! that is interesting! that could be the ideal router for a 1U server in a colocation datacenter!”.
But issues like what you mention here (and also the reverse: what will happen to the card when the host is rebooted? will it remain running?), plus the lack of drivers for VMware ESXi, has made me put it aside as “unfortunately, not so interesting”.
What we need is a card that can function almost independently from the host, with only the data path via the PCI bus. So we can tie a cable from the ethernet port to the ILO/DRAC port, and have a 10Gbit or faster link to the ISP. And then be able to setup a VPN to manage the host via ILO and directly on the network interface.
If there is ever revision 2 of this card, adding optional external power (either jack or POE or internal connector) would help. That way it would be possible to run it at all times, no matter what’s the state of PC. Or even use it as a “router on a stick” like some users tried (and fried) using mining riser.
As for drivers, that’s still a problem. Too bad it requires modified ones so running any unsupported OS is just out of the question…
I´m not sure if that was my PicoPSU or the riser. I have measured the outputs of that PicoPSU after the fuse was blown and they were OK. So I rather tend to say, that the riser had issues and the capacitor exploded on it, so I can´t test that any more.
I would not use a plastic case, but something out of sheet metal for fire protection (it should run 24hours a day for many years). Of course it´s doable, but it needs time and effort, to manufacture it.
Cooling should be considered as well. I don´t have the time for a custom made one.
So to me it is an issue. Not an unsolvable one, I could even take one of my MiniITX cases.
They are however bulky, compared to an RB5009. Taking care of low noise cooling+ custom case + power would also add to the costs.
Besides my most important usecase requires OpenBSD compatibility, which is not given and there was no information if MT will support the driver development.
It seems to me, for that case the cards are perfect. You don´t need a host for the cards and you could probably get multi PCIe riser boards in higher quality. Besides I´m sure you could just use any server case, powering the whole thing with a beefy standard server PSU.
The missing drivers are not needed in this case, while you get the best price/performance for a router from MT.
Great idea! The only problem I see is the avilability of the cards…
I don’t see a need to hack a case for this card, in that situation I would just buy a router in a case.
My use-case is in colocation environments where you rent only a single 1U slot and thus there is no space for an external router, only your 1U server.
With this card you would be able to have an external router to protect the management interfaces of your server (ILO/DRAC, RDP, SSH etc) using a VPN, and have routing functionality in general.
The card would be placed in a slot of the server, the ethernet port connected to the ILO/DRAC port, and the SFP used to connect to the ISP network.
I did see a need to hack a case, becuase my usecase is that of a remote location (“spoke” - meaning the flat of my family) equipped with an X86 OpenSense box, which should have been prepared for 10G interfaces.
At the time of the purchase of the MT card it was no big additional cost to purchase a CCR2004 card intead of a dual SFP+ Intel card.
It turned out, that the OpenBSD drivers for the CCR2004 are unusable.
My next idea would have been to use my CCR2004 as a standalone router instead of an rb5009, but there were some issues…
I just installed two of these cards in 2 separate servers. When I reboot the Mikrotik router, one of them which has ESXi, crashes and hangs in purple screen. The one with the linux reboots. I believe it is a bug and may be reboot command causes the server crash. I was going to use them in all of my servers but I cannot go on with this problem. I have to shutdown the server safely but then the Mikrotik card is also shuts down. Crazy chicken and egg problem.
Yeah, it is clear that this card, which seemed very attractive for co-located servers requiring a router, is not usable in practice.
E.g. it would also have to be running when the system is in STANDBY state, so you can poweroff the server via ILO/DRAC and then still be connected to send a poweron command later.
So, it all was just a dream.
probably the card doing some strange PCI init thing - what causes the machine to crash or forces reboot.
If a soft-restart (like watchdog reboot) can be survived by the host, and only firmware upgrade would need cold restart,
that would be something i can live with.
Until than, i only can ask if there is someone who builds a 2U chassis (like for crypto mining purposes) where i can put some of those cards - and do nothing but powers PCI bus with redundant power supplies?
I’ve been running this card in a HP XL420/Apollo 4200 Gen9, on fw 7.8 for a while, without any crashes. Without further research, I have deployed 7.12 and 7.16 in quick succession and found out about the reboot bug right after – but since then I have experienced 2 random crashes with the same
Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000013, Status 0xBE200000'000C110A, Address 0x00000000'80500000, Misc 0x54FC3816'00402086)
message, without any manual reboots of the card.
For now, I downgraded system package to 7.12.1, while left the routerboard firmware on 7.16.1, and will report in on any further crashes.