We have been having an issue where the interface’s graphing is, for the lack of a better term, smooth.
Please see what I mean on the below image:
We are also experiencing that the access to the router via Winbox is very sluggish, although the latency to the router is constantly in the 23ms region.
The router in question is a CCR1016-12S-1S+ running RouterOS 6.42.6.
CPU load on the router is sitting around 15-20% at most, and has been up for 25days(and change) at this point.
The routing table is around 251k routes and the aggregate amount of traffic running though the router is anything from 500Mbps to around 1Gbps.
As far as we can tell, all the above has no impact on the overall user experience for our clients that are passing through the router.
It is however effecting our graphing, in that we can’t probably see what our 95th percentile usage is for billing purposes.
I know this is quite weird, and I have waiting some months before now finally taking the plunge and posting it on here.
Any assistance that you good folks can give would be appreciated.
Gaps may be due to congestion between Winbox and device. If you are graphing with SNMP it shouldn’t matter, these gaps won’t affect that. How are you doing the graphing for billing purposes?
There definitely isn’t congestion between winbox and the device, but just to make sure we tested it from a connection that is fibred to it(via a 3rd party, but with plenty of capacity).
Below is the screenshots from both locations for comparison(please ignore the packet loss on the second pic, had 2 pppoe’s with the same local address):
As for the graphing via SNMP, that is also a wreck, with peaks of up to 90Gbps on a 1Gbps interface.
As for billing, fortunately we use IP>Accounting, which is unaffected.
We are using LibreNMS, and it is default configuration. Works perfectly fine for over 100 other devices, some doing much more traffic via worse connections.
Default configuration doesn’t really say too much to me - what SNMP version is it using, etc.
The graph gaps may be due to a single core being overloaded, the one processing your Winbox session. You can’t do very much about that, unless you remove the other thing that is causing the load, but it shouldn’t be affecting the SNMP graphing at all. SNMP graphing uses differences in packet counter values over time to measure speed, so every 5 minutes or whatever it will grab the counter value for the number of bytes over that period. With SNMP v1 you would end up with many issues.
Sorry, I misunderstood your question. We are using SNMP v2. Here is an export:
/snmp community
set [ find default=yes ] addresses=0.0.0.0/0
add addresses=192.168.1.1/32,192.168.0.0/24 name=CompanyName
/snmp
set contact=contact@companyname.co.za enabled=yes location="IDC Location" trap-community=CompanyName trap-version=2
I changed the the “address” from our public IP’s to private IP’s, but the subnet sizes are the same.
“name” and “trap-community” was also changed, as well as “contact” and “location”, but I can confirm that the confirm that the config(excluding location) is identical to our other routers which are working fine.
As for the thought that a single core is being overloaded, should that not be visible via the Profile tool?
On the off chance that you meant my PC’s CPU core being overloaded, I included that as well, please see below.
Picked up something now which may or may not help identifying the issue.
I opened up the fasttrack connection mangle rule, to check it’s traffic graph, and it looks much better overall.
What was weird was that when I opened any interface, the graph went bad.
The export from your router does not tell me what SNMP version is being used. Typically you set the SNMP version for polling purposes in the NMS on a per device basis. I see from your export that you are sending v2 traps to your NMS, but you are not graphing traps, and with that configuration your NMS could use either SNMP v1 or v2 and I don’t know which it is using. Most NMS systems I have used use SNMP v1 by default for each device unless you change it to v2 per device.
I was not referring to your PC, but the router itself.
There is that one CPU at 50% load. I would monitor Profile for a couple minutes and look carefully to see if it is jumping up to >90% every so often. The profiler does not update often enough to catch really short CPU spikes. They may be happening often enough to cause problems for your Winbox but not often enough to appear in profiler unless it happens to measure at the moment it is happening.
Took a quick video of it and posted it on Youtube. Tried to replace the office noise with some music and I think I made a mess of it, so just turn off the volume to save your ears.
Following this lost, we have a 1016 on .26 firmware as well. Our graphs are mostly full on a interface but have some blank spaces in the graph even though it is the main trunk port with constant traffic. Let us know if your firmware upgrades help.
Sorry mate, but did you even read the thread? I don’t want to turn away people that are trying to help, but I also want to avoid it turning into an advertisement board for other programs.
Sorry for not getting back to you. I expect you are just running into some issue where the same process on the router that is reading the info for the graph is maxed out because it is doing something else (maybe your use of the traffic accounting feature is somehow overloading those parts of the router so that it no longer shows the graphs properly).
I’m not exactly sure whether there is an easy way of fixing it, short of switching off functions that you probably need to use otherwise you wouldn’t have them on.
The SNMP graphs should be unaffected as long as your NMS system is set to graph the router with SNMP v2c and not SNMP v1. I’ve seen very overloaded the routers and the SNMP graphing is always fine, unless the NMS uses v1 SNMP and then you can get crazy looking spikes.
As previously mentioned, we are using LibreNMS, which is using SNMP V2.
Below is a screenshot of the NMS config for this routerboard. It is the same as all other routerboards we have on the NMS.
As for the concern about IP>Accounting, we use that across the network and are not seeing this anywhere else.
We have routerboards with 200+ PPPOE’s doing queues and all the fun stuff that comes with being PPPOE concentrators, running IP>Accounting, that are not having this issue, even though some of them are less powerful RB’s, such as a RB1100AHx2 or RB3011.
As a test I tried switching off IP>Accounting for a few minutes, on the routerboard where we are seeing the issue, and it made no difference.
Lastly, as of this post, the routerboard in question is running 6.43.1
Update - We replaced the CCR this morning with a brand new one, and updated to RouterOS 6.43.4 in the process.
We are still seeing the exact same thing.
We replaced our 1016 almost two weeks ago with a brand new one. I forgot about this thread I had posted in and went and checked the graph on the interface that we had gaps on in the past. I have been watching the graph for about ten minutes now and there are no longer any gaps in the graph. We are still running 6.42.6 firmware but will be updating to the long term 6.42.9 firmware next week.