I’ve been testing RouterOS for about a month and there are just too many bugs.
Ping
Sending ICMP packets from Mikrotik to a host on same network always report packet loss but pinging from the host reports no packet loss.
Mikrotik 4.10, 4.11, 5.0rc1 all same result. Icmp rate limit sysctl was set to 0.
From Mikrotik to a server
ping count=100 interval=10ms xxx.xxx.xxx.xxx
[omitted]
100 packets transmitted, 88 packets received, 12% packet loss
round-trip min/avg/max = 0/1.0/8 ms
From the server to Mikrotik
ping -q -A -c1000 xxx.xxx.xxx.xxx
PING xxx.xxx.xxx.xxx (xxx.xxx.xxx.xxx) 56(84) bytes of data.
2. Incorrect millisecond conversion
Creating a “Dst. Limit” criteria mangle rule with expire 5000ms from Winbox turns into 50s, not 5s when exported from CLI.
Same result on Mikrotik 4.10, 4.11, 5.0rc1.
3. PPPoE - incorrect radius accounting counters
When a PPPoE user connects and disconnects frequently, often previous session’s Acct-Input-Octets/Acct-Output-Octets value never resets and it gets carried to next session. Raw accounting packets can be provided if needed. Tested on 4.10 and 4.11.
4. Mangle chain
Mangle rules completely not working under version 5.0rc1.
Anyone having any of above issue?
Lastly, is there RAW table in RouterOS? I could not find in Wiki nor any of documentation.
Not all IPs require conntrack and under some circumstances such as DDoS attack, setting NOTRACK for certain IPs can ease CPU utilisation.
Finding 4 bugs (there are certainly more) in something as complex as ROS is quite a small number really.
Ping is a very blunt tool, and if it doesn’t work perfectly, i personally would not put it at the top of my ‘to fix’ list. Having said that, ICMP echo/reply is very easy to code, so i can’t imagine what’s wrong there.
Incorrect millisecond conversion - as you know about this, it’s easy to workaround, and would also not be a Top Priority to fix.
PPPoE - incorrect radius accounting counters - that is worth reporting as it’s a more important thing to fix. No point having Accounts if the dats is wrong.
Mangle chain … 5.0rc1 - Release Candidate - now you’ve reported it, hopefully it will get fixed.
Did you report all of these findings before posting to the forum ?
I was more frustrated with the ping issue than radius accounting issue. How many network admins assume ping could be reporting inaccurate result? Ping is the first tool that I use for any network issue. Except some firewall environment, it’s still a very useful tool to diagnose and usually a packet loss indicates that there is an issue such as congestion, duplex mismatch etc. If such a basic tool goes wrong, even basic network setup can take substantial time.
Incorrect conversion - I was not sure whether Winbox or CLI was wrong because netfilter hashlimit module accepts millisecond value, not second.
I’ve just reported millisecond conversion and radius accounting issue.
Regarding the ping issue, I noticed you used a 10ms interval, which in the case of routeros means any replies that take longer than 10ms will be counted as loss. (and will also not show up in the max column, which I notice was 8, so its possible that 12% of the packets had a longer than 10ms response, rather than actually lost).
Try 20ms or 30ms and see what happens. You can look for both loss, and the maximum response time, then go from there to try to figure out if the host just isn’t always responding promptly or if its routeros, but some devices do not consider responding to ICMP echos to be a top priority.
I haven’t noticed issues 2, 3, and 4. I would consider 2 minor, 3 is major and I will have investigate that one to see if we’re seeing it; 4 would be major for those relying on it, but we don’t use mangle much here.
Yes, it seems icmp replies with RTT more than interval are discarded. 20ms show no packet loss. Thanks!!
Here’s details for radius issue, basically previous session’s in/out octet counters are carried over to next session. So each time this user gets disconnection for some reason, usage doubles plus new usage. This continues until manual disconnection from CLI. Anyone had similar experience?
DOH ! Sorry to miss such an obvious thing like the RTT = 10ms.
The Radius accounting records might not be any problem at all, depending on how the server side maths are done.
The accounting is done base on the ‘username’ in the database, and also on the diference between the last Accounting record and the one that has just arrived, unless an Accounting Stop and then and Accounting Start request are made.
So, if the session stops at say 1,000,000 bytes, the client disconnects, reconnects, and the byte counter is still at 1,000,000 bytes for the next accounting record, then difference will be Zero - no extra bytes registered.
If this happened 50 times, the difference would still be zero ‘new’ bytes.
The next proper accounting record might be 1,100,000 which would mean 100,000 ‘new’ bytes to account for.
I might be wrong, but it seems less of an issue if this is how something like Usermanager does it.
Now that i think about it, it would be more of a disaster is the Accounting/Stop and Accounting Start were missed and the byte counters were zero - that would signify either a rollover of Acct-Gigawords + Acct-Bytes or a ‘info total garbage’ situation.
I just took a quick look at some accounting records for PPPoE and they seem to be starting over from 0 for each connection just fine, from a 5.0beta6 router.
Are these from PPPoE connections or some other type? Also, it appears you’re using freeradius. Double-check to make sure its not configured to accumulate, instead of record absolute values.
You could packet-capture, or enable radius logging to see what the router is actually sending.
Re-reading the previous, is this only happening sometimes (when PPPoE user disconnects/reconnects rapidly?)
Yes, PPPoE only. No, it’s not configured to accumulate. As posted earlier, below is a radius accounting packet sent by the mikrotik router, it clearly shows that the radius server is not the problem here.
It only happens to a few sessions per day, usually less than 5 second gap between each session. I’ve tried to replicate but no success. Changes to Interim-Update interval and one-session-per-host made no difference.
It seems under some specific circumstances, pppoe sessions do not get disconnected properly and the counters get carried to the next session.
Much harder when there is more than 1 programmer on it, cos everyone has a different style and way of thinking.
Give them some time, because each problem you found was in a different area of code.
I sure as shi’ite couldn’t find, then fix (100% sure) all of those in 1 month.
Doh ! then they gotta be tested before being released.
More than 1 month basically.
Once again, you’re brilliant in that you have Hard Data to help pinpoint where the code is failing.
Without that, it is almost impossible to find where the problem is.
Apparently they cant either… Which leaves me with a choice of using openvpn server which clobbers itself, ipsec/l2tp which has non-functional nat-t, or sstp which leaks memory like no other forcing me to reboot the router every few days or risk it running out of memory. SSH server that crashes all the time, idle timeouts that dont function right as any traffic that is destined to go through the interface is counted as traffic, even though the connection is actually down and not passing traffic, an RB1100 with several ports that wont work on 4 different devices that I have tried, responses from support saying that it is fixed in 5.0rc1 when the supouts I send demonstrating the problem clearly identify themselves as 5.0rc1.
crossposting to other topics will not help at all. please write to support with link to this thread, short and accurate description of the problem, and supout.rif file
I don’t believe that you have an unresolved ticket with support about this for a year. If so, let me know the ticket number and I will check why nobody has solved it for you.