v7.2.2 [stable] and v7.2.3 [stable] are released!

Paternot · May 26, 2022, 2:38pm

Routing cache was removed from the Linux kernel, a long time ago - security issues. This is showing only in version 7.x because RoS 6.x runs one really ancient kernel.

No, it is NOT fake result. What DOES happes is this:

Route caching works with a limit number of routes. Exceed this limit, and it has no benefits.
A router used on a SoHo environment will probably get faster routing with route cache, since it will use a small number of routes.
A router used on an ISP environment probably won’t benefit from it - since it will see a huge number of routes.

Some people say this is a fake result, since it would give a greater speed with Speedtest and services alike in contrast with “real” usage. I disagree: what is “real” usage changes with every case - and SoHo routing is a huge use case for Mikrotik.

But, yes: RoS 7.x uses more CPU than 6.x, due to the removing of routing cache. And it will NOT come back. We will have to adapt, and specify our routers accordingly.

system · May 26, 2022, 2:46pm

Routing cache was removed from the Linux kernel, a long time ago - security issues. This is showing only in version 7.x because RoS 6.x runs one really ancient kernel.

No, it is NOT fake result. What DOES happes is this:

Route caching works with a limit number of routes. Exceed this limit, and it has no benefits.

A router used on a SoHo environment will probably get faster routing with route cache, since it will use a small number of routes.

A router used on an ISP environment probably won’t benefit from it - since it will see a huge number of routes.

Some people say this is a fake result, since it would give a greater speed with Speedtest and services alike in contrast with “real” usage. I disagree: what is “real” usage changes with every case - and SoHo routing is a huge use case for Mikrotik.

But, yes: RoS 7.x uses more CPU than 6.x, due to the removing of routing cache. And it will NOT come back. We will have to adapt, and specify our routers accordingly.

Thanks alot for taking the time to explain, your comment should be pinned somewhere, for the one who will ask himself the same questions as me tomorrow after having bought his router

tangent · May 26, 2022, 3:17pm

I can’t find information on the size of this route cache, when it existed. I wonder how much those 10-year-old conclusions hold up in today’s Internet, where a given web page might result in a hundred hits, many to servers in disparate data centers. Or, take the case of CDNs, where simply “flipping channels” in your OTT app of choice might pull from multiple data centers for the same reason. Are we talking about a cache of a dozen routes? Hundreds? Thousands?

I don’t think the existence proof of RouterOS 6 tells the whole story. According to the slides for a presentation by one of the people who did the work to remove the route cache, an effort which finished nearly a decade ago, they had a goal of only a 10% speed hit for their udpflood test case, which seems even more artificial than Internet speed tests.

Regardless, the linked slide deck shows why they thought they had to remove it.

system · May 26, 2022, 3:47pm

just found the explanation from mikrotik support here : http://forum.mikrotik.com/t/ccr2004-high-cpu-usage-ros7/152163/3

By the way, now i don’t understand why in V7 it says : route cache : yes
Maybe it’s not related, it’s confusing

tangent · May 26, 2022, 3:59pm

Presumably that’s the “…completely different route lookup algorithm…” in ROS 7 that raimondsp mentioned in the post you linked.

winap · May 26, 2022, 4:23pm

Paternot:
Thank you so much!

mkx · May 26, 2022, 4:29pm

I found a nice article explaining some inttinsics of route caching … about dimensioning it says:

As a rule of thumb, two millions entries eat about 500 MiB of memory on a 64-bit system. You should be able to compute the average memory usage and the maximum memory usage from the values of net.ipv4.route.max_size, rhash_entries and net.ipv4.route.gc_elasticity. For example, if the route cache hash table has 262,144 buckets, the maximum allowed number of entries in the cache is 4,194,304 and net.ipv4.route.gc_elasticity is set to 8, the memory usage will be 500 MiB on average and 1 GiB max. If this is too much, you will need to lower some values.

so route cache size actually depends on available memory unless set otherwise.

I guess memory usage on 32-bit system is a bit less (but still more than half), but we’re still talking about a few hundred kB on modest home router and a few MB on a small-business router (and much more on ISP router which likely runs out of memory sooner rather than later making route cache less efficient).

tangent · May 26, 2022, 4:34pm

So ~2000 routes with 0.5 MB of RAM. That’s usable for home scenarios. The proper argument therefore is security in a hostile environment rather than ineffectiveness.

mkx · May 26, 2022, 5:00pm

I guess route caching can be (relatively) inedffective in both extreme cases: in end-host case where anything but hosts from own subnet is behind single gateway … and in core routing business with large number of gateways where number of route cache entries easily rises beyond 2G.

pe1chl · May 26, 2022, 5:27pm

Wait a moment, people! The “route cache” and its size has nothing to do with the maximum route table size. And the effect has nothing to do with the number of routes.
The “route cache” cache does not cache routes, rather it caches route lookup results.
When the router has to send a packet to 1.2.3.4 it looks in the route table and finds “you need to send that to gateway 5.6.7.8 on interface ether1”.
When it has to send a packet to 192.168.88.10 it looks and finds “you need to send that to bridge1”.
That is the kind of things that route cache stored. The result of lookup actions. So when the next packet goes to 1.2.3.4 it does not have to look in the route table again.
(which in terms of CPU activity unfortunately is more complicated than you may think, because “more specific routes have more priority than less specific routes”, so you need to check many things before you can assume the default route is the correct one)

Of course this gets more effective when you have a more complex route table, but also it gets more effective when you have only a small number of addresses involved in traffic at a given time. The route cache can store only a couple of entries, or else the action “see if we have this address in our cache” will take similar time as “lookup in the route table”. So maybe it caches 5-10 entries.
When you are doing a speedtest, the router only has to route 2 addresses and all lookups are cache hits.
But when you are visiting a modern website or when multiple users are making multiple connections (e.g. your television, computer, mobile phone etc are all having a connection open), there will be more cache misses than hits, and the extra time spent to check if the address is in the cache is just wasted.

So in a test scenario it showed good results, but in practice it did not work. That is why it was removed, and why it won’t return.

holvoetn · May 26, 2022, 5:47pm

Hence my use of the term ‘fake results’

Jotne · May 26, 2022, 6:08pm

ppp,error,critical: 217.67..: Encryption got out of sync - disabling

At what severity level is this message?
Error or Critical

From Syslog Severity level
https://en.wikipedia.org/wiki/Syslog

0	Emergency
1	Alert
2	Critical
3	Error
4	Warning
5	Notice
6	Informational
7	Debug

Please Mikrotik fix the mess of information tagging.

http://forum.mikrotik.com/t/logging-prefix-is-a-mess-sup-105353-sup-144261-waiting-for-mt-to-support-rfc-5424/111067/1
I made this post 2017 and sent support information about this and was told that they will look at it, but nothing has changed.

pe1chl · May 26, 2022, 7:54pm

Likely the problem is that a complete overhaul of the logging system would make some people (who have invested time in handling the mess as it is) angry…
As I also wrote in the other thread it should be changed in such a way that you can filter individual messages in the logging topic filtering, both internal to the router and in external handler programs.
But that means existing matching is going to break.

Paternot · May 26, 2022, 8:31pm

Of course this gets more effective when you have a more complex route table, but also it gets more effective when you have only a small number of addresses involved in traffic at a given time. The route cache can store only a couple of entries, or else the action “see if we have this address in our cache” will take similar time as “lookup in the route table”. So maybe it caches 5-10 entries.
When you are doing a speedtest, the router only has to route 2 addresses and all lookups are cache hits.
But when you are visiting a modern website or when multiple users are making multiple connections (e.g. your television, computer, mobile phone etc are all having a connection open), there will be more cache misses than hits, and the extra time spent to check if the address is in the cache is just wasted.

So in a test scenario it showed good results, but in practice it did not work. That is why it was removed, and why it won’t return.

It worked with far more than a few cached routes. Take a look at this link:
https://access.redhat.com/solutions/71333

He is complaining of a little more than 5% cache miss - out from a table with 7 million entries.

Yes, probably a small router wouldn’t have enough CPU (let alone memory) to do this - but it only need to cache about 10k hashes in order to make a huge difference in a SoHo environment.

Jotne · May 26, 2022, 8:33pm

One of them is me, since I have to rewrite the Splunk Mikrotik logging.
But its worth it.

pe1chl · May 26, 2022, 8:57pm

Remember it was not MikroTik or me who deemed the cache ineffective and removed it. It happened in the Linux kernel development team, and MikroTik has
to live with its removal.
There can be other optimizations. E.g. in a SoHo environment probably half of the traffic is routed using the default route, a route that is most expensive of all to lookup.
But it would be possible to have a “bitmap” that has 1 bit corresponding to every possible IPv4 address in 2 megabytes of RAM, with the bits in that bitmap indicating “is this address to be routed using the default route yes/no”. Most of the current routers can easily spare 2 megabytes RAM, and a costly lookup for half of the packet routings would be replaced by a single index and bit test.
When the router is the really basic home setup with only TWO routes in the table (one to the local bridge and the other being the default route to internet), the bit test could even decide between those two cases and replace all routing table lookup.

But apparently it is not worth it.

Paternot · May 27, 2022, 12:36am

I’m not advocating to get it back - it had serious problems. I’m just pointing out that the results of its use weren’t fake - they were just faster for some use cases and slower for others.

mkx · May 27, 2022, 5:27am

One bit per address means 2³² bits which equals 2²⁴ bytes which is 16MB actually. Which is a very considerable portion of RAM installed in Mikrotik’s low-end devices. And searching through that amount of RAM using slow CPU is probably way slower than lookup in route table, SOHO devices will have only a few routes there. Route cache searches were done using hash value for a reason.

And we haven’t started with IPv6 yet, performance of it is slowly becoming mainstream news lately.

pe1chl · May 27, 2022, 8:21am

Ok now we both made a calculation mistake… it actually would require 512MB or a 448MB when only considering the routable portion of the address space.
So less feasible than I thought. When it would be a bit per /24 subnet it would be the 2MB that I calculated.
However, the advantage of the method is that it does not require “searching”. It only requires indexing and bit testing. That is why it is fast.

For others: the issue with route table lookup is that your route table may contain can contain different routes like this:
0.0.0.0/0 (the default route)
192.168.88.0/24 (the local network route when using MikroTik default config)
192.168.100.0/28 (maybe you have a small space for VPN connections)
etc

To lookup an address in this table you cannot consider the table as one big array where you are looking for a number. You will effectively have to do up to 33 lookups.
First check if a /32 route exists that matches. If not, check if there is a /31 route. If not, check if there is a /30 route.
Etc etc. until you either find a matching route or you end at “check if there is a /0 route” which always matches.

Now of course the data structure for storing the routing table is optimized for fast lookup by dividing it in 33 different structures that each have some form of hashing.
So you walk along those 33 different structures (from a head pointer) and often most of them will be empty (e.g. there is no route with /1 or /2 in your table), so they can be quickly skipped without running expensive hash and compare operations.
But when there are several routes with different subnet sizes, it can still be an “expensive” operation. In my home router there are 272 active routes of many different subnet sizes (32,29,28,27,24,22,16,10,9,8 and 0). But that is not the typical home user situation of course. However, in an internet router there are many many more of them.

The principle of the route cache was that a separate routing table was created with only /32 routes. Every time a lookup in the normal table succeeds, an entry in the cache table was created with the /32 address that was actually looked up, and containing the information from the /nn entry in the normal table where it was found. And some expiry timer.
Now for further routing always this table was searched first. But because it only has /32 routes it does not require the 33 iterations and the lookup will be faster.
Within a single subnet size the route table data structure uses a form of hashing to quickly find a matching entry so it does not have to do a linear search that becomes slower proportionally to the number of items in the cache. My estimation of 5-10 entries was far too low and based on another system that had a small cache that was linearly searched, indeed now I remember that in Linux it was much larger.
Apparently there still were problems with it. Also, I remember that it wasn’t automatically flushed when the routing table was changed, so when you e.g. manually added routes you needed to do a “ip route flush table cache” command or else the cache would return invalid results until its entries timed out by themselves. Probably not convenient when running an automatic routing protocol.

Finally, sometimes in Linux features are changed or dropped because the maintainer or a small committee does not see the value of it. Often they do not consider broader use cases than they have in their own world. This is clearly visible in the optimization of several core parts of the system (e.g. systemd, network manager) towards the use-case of a laptop that has to boot quickly, roam along different fixed and wireless networks that have DHCP, and has only a single active network connection.
Personally I do not care at all how fast the system boots, and I have more use for systems with several network connections, services running only on a part of them, and with a predictable startup order and -synchronization. But I have to live with the constant demolishing of the working solutions for that and replacement with facilities that work in the world of the writer but are inconvenient for me.
The route cache is only a minor example of such changes.

Larsa · May 27, 2022, 9:48am

Cisco and many other network tech companies (besides MT) have contributed a significant amount of work to Linux development especially to the network components mainly because they depend on certain features to work properly in their own solutions. Thus I’m pretty sure they had plenty of different use cases in mind when developing and testing the new routing components.

I think the risk is very small that just a single individual may influence which parts are to be kept or not especially regarding a key component as routing and Linus’ (& co.) exceptional need for control of the kernel in mind