CPU usage 6.49.6 -> 7.2.1

ofca · Wed Apr 27, 2022 9:46 am

So, this RB3011 was very busy computing uptime at the moment of upgrade, and same process of being busy computing uptime suddenly became very cpu-intensive post-upgrade. Pretty much same thing happens on RB2011s. Anyone else seeing this? If I didn't know better, I'd suspect Mikrotik started mining Monero or smth

ofca · Wed Apr 27, 2022 11:02 am

oh, that's great. People starting using it, and while this router peaked at ~60% CPU on 6.49.6 stable (~400 mbit traffic routed), it's now reaching 100% and losing packets on 7.2.1 pre-alpha when reaching ~120 mbit. Guess I'll take a sup-out before downgrading. Are there any better tools to try to figure out what's wrong other than:

> /tool/profile cpu=all duration=10s
Columns: NAME, CPU, USAGE
NAME          CPU  USAGE
snmp            0  0.5% 
ethernet        0  2.5% 
console         0  0%   
firewall        0  5%   
networking      0  11.5%
logging         0  0%   
management      0  12.5%
wireless        0  3%   
encrypting      0  5%   
routing         0  21.5%
ssl             0  1%   
profiling       0  0%   
bridging        0  0%   
unclassified    0  4%   
cpu0               66.5%
snmp            1  0%   
ethernet        1  5%   
firewall        1  10%  
networking      1  30.5%
management      1  1.5% 
wireless        1  3.5% 
encrypting      1  14.5%
routing         1  13%  
ssl             1  2%   
bridging        1  3.5% 
unclassified    1  8%   
cpu1               91.5%

jookraw · Wed Apr 27, 2022 11:52 am

do you have fast-track enabled?

also, there is no route-cache on rOS v7.x.x like rOSv6 had

pe1chl · Wed Apr 27, 2022 12:09 pm

Version 7 will use more CPU than version 6. When you are already using most of your capacity and do not want to lose it, do not upgrade to v7 before you are buying new hardware.

ofca · Wed Apr 27, 2022 12:18 pm

do you have fast-track enabled?

also, there is no route-cache on rOS v7.x.x like rOSv6 had

One of the culprits was 10ms keepalive time on BGP sessions. For some reason crossfig or whatever it's called decided, that keepalive=1s in 6.x means keepalive=10ms in 7.x (which is impossible to set by hand). Not only this: routing engine decided to obey and happily spammed keepalives 100 times per second. Setting it back to 1s and restarting sessions reduced CPU usage by 20%

There's NAT and some firewall here, and fast path/fast track aren't active. That being said, I'm not expecting 0% CPU load from this router. Just something reasonable

btw. if there's no route-cache on v7, then what does "/ip settings set route-cache=yes" do?

jookraw · Wed Apr 27, 2022 12:27 pm

btw. if there's no route-cache on v7, then what does "/ip settings set route-cache=yes" do?

to be frank, I have no idea... this info was what I got from support when opened a ticket for high cpu usage on my RB4011 before the fasttrack was fixed.

ofca · Wed Apr 27, 2022 12:33 pm

btw. if there's no route-cache on v7, then what does "/ip settings set route-cache=yes" do?
to be frank, I have no idea... this info was what I got from support when opened a ticket for high cpu usage on my RB4011 before the fasttrack was fixed.

maybe there wasn't, but now there is? Who knows.
Still, even after fixing the impossible BGP configuration, and noticing that IPSec wasn't taking advantage of HW acceleration, still 75% CPU usage at 100 mbit traffic. Well, at least there's no packet loss and end users are unaffected, so I can postpone the downgrade and see what else is broken.

Wed Apr 27, 2022 12:37 pm

route cache setting is doing nothing, it will be removed in the future.

ofca · Wed Apr 27, 2022 12:51 pm

route cache setting is doing nothing, it will be removed in the future.

Thanks for clearing this up. Please figure out why keepalive-time=1s gets converted to 10ms when upgrading from 6 to 7. I've seen this few times already, but didn't investigate until now.
btw. do you have any suggestions other than /routing/stats/process/print or /tool/profile ?

andkar · Wed Apr 27, 2022 12:59 pm

Hi,

See this thread viewtopic.php?t=185242 It seems to be bridge bandwidth issues with RB3011 and v7.

msatter · Wed Apr 27, 2022 2:07 pm

Mikrotik about not route-cache in v7:

viewtopic.php?p=882429#p882429

and read on down.

pe1chl · Wed Apr 27, 2022 2:13 pm

route cache setting is doing nothing, it will be removed in the future.
Please figure out why keepalive-time=1s gets converted to 10ms when upgrading from 6 to 7. I've seen this few times already, but didn't investigate until now.

1 second is an unreasonably low keepalive time. You would normally not set the keepalive time but rather set the hold time, and the keepalive time will be 1/3 of that.
Indeed 3s is the lowest hold time, and it would result in a 1s keepalive time, but I think in cases where you want fast BGP response to link down it is better to use BFD.
(unfortunately BFD does not yet work in v7 but it is "promised" to arrive soon)

Wed Apr 27, 2022 2:48 pm

Problem confirmed, but bug aside as pe1chl mentioned 1s is an unreasonably low value for keepalive.

ofca · Wed Apr 27, 2022 6:18 pm

Please figure out why keepalive-time=1s gets converted to 10ms when upgrading from 6 to 7. I've seen this few times already, but didn't investigate until now.
1 second is an unreasonably low keepalive time. You would normally not set the keepalive time but rather set the hold time, and the keepalive time will be 1/3 of that.
Indeed 3s is the lowest hold time, and it would result in a 1s keepalive time, but I think in cases where you want fast BGP response to link down it is better to use BFD.
(unfortunately BFD does not yet work in v7 but it is "promised" to arrive soon)

Only thing that's unreasonable is converting 1000 milliseconds to 10 milliseconds when upgrading from 6.49.6 to 7.2.1; when one day BFD arrives, I'll use it to get response times faster than 3 seconds, but until then I guess I'll have to live with 3 second lag until redundancy kicks in after some usual fiber vs. rat or fiber vs. runaway excavator

btw. I have some rare cases of hold-time=10s, but still keep keepalive=1s there.

Thu Apr 28, 2022 9:59 am

Increasing the frequency of keepalives does not make BGP converge faster. Hold time is the one that controls it. The only reason to set very frequent keepalives is when the latency or packet drop on the working link is so high, that you need to send 10 keepalives to make sure that at least one of them will reach the destination within 10 seconds.

pe1chl · Thu Apr 28, 2022 10:41 am

Increasing the frequency of keepalives does not make BGP converge faster. Hold time is the one that controls it. The only reason to set very frequent keepalives is when the latency or packet drop on the working link is so high, that you need to send 10 keepalives to make sure that at least one of them will reach the destination within 10 seconds.

That would not work, as BGP is running over TCP not over UDP. New keepalives are inserted above TCP and it is the TCP re-try mechanism that governs sending them to the other side. A well-implemented TCP would not even try to send the newly added data before the (re-)transmission timers of the existing data kicked in (or an ACK is received).

Fiddling with the BGP timers is usually done to get quicker detection of a link state change when the underlying layers do not provide that information. But BFD is a more suitable mechanism for that.

CPU usage 6.49.6 -> 7.2.1

CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Re: CPU usage 6.49.6 -> 7.2.1

Who is online