Running ROS 3.24 on RB1000U, doing VLAN routing of IP traffic, most other features disabled, have a few simple queues, that’s about it.
Today I saw something that is a first in nearly a year (since I put this configuration into production): CPU usage briefly spiked to nearly 100%, and latency became severe, something like 50% packet loss. You can see the very short spikes on the graph below - it happened twice:
Unfortunately, by the time I determined that this was the cause of the latency vs some other network issue and was able to log in to Winbox, the incident had passed and CPU usage was already settling down to pre-spike levels. The second time it happened, it was away from the office, so I never had the chance.
There’s about 5 dozen client VLANs being routed by 1000U via a connected L2 switch. I figured this spike was due to some near-Gbps burst of traffic, but in fact there was only a moderate spike on just one client graph which corresponds exactly with each incident:
Now, this router has been stress-tested with a client a few months ago, who was doing traffic well in excess of 500Mbps, and it handled it fine, no latency. The config of the router has changed very little - a few client VLANs have been dropped, and few new ones have been added. The overall VLAN and IP address count is down, and the traffic is down.
So here’s the question: why did each of these 25Mbps - 30Mbps spikes on this one client VLAN bring the RB1000 to it’s knees? It’s routinely routing about 50Mbps - 80Mbps in each direction and CPU loading is steady at around 25%. Unfortunately I don’t have any data on the traffic from the incidents. I’ve emailed the client and asked him to check his logs; if he produces anything, I’ll post it here. But maybe someone out there has been down this road before, or has an idea?
On a related note, can someone suggest a good way to capture an log traffic data from the RB1000 in a way that will offer good forensics if this happens again? It seems from reading this forum that trafr is based on a very old linux kernel and is essentially broken on anything of recent vintage. Also, the sniffer itself seems kind of broken, at least from Winbox - I tried running to with the interface set just to one vlan, but even after applying that change, when I run it it captures data from every interface anyway!
It might have been a very small burst of bandwidth, but it may have been made up of very small packets. Unless you have a graph of the packet rates as well, you will not be able to determine this. I have seen commerical, big name firewalls brought to their knees with just 10Mb of traffic, but when examined turned out to be in excess of 30,000 packets per second.
Your RRD/MRTG graph has averaged out the peak of the bandwidth burst, and it was actually much higher than you can see. RRD normally uses a 5 minute averaging, so very quick spikes in traffic will not show up well. For example:
An RRD graph will record an average of 5Mb for that 5 minute window, even though it spiked to 20Mb. There’s also a chance that it did record a higher spike, but can’t display it due to the small pixel dimensions of your graph.
You can see this in the two graphs below. The first is from our cacti system, which uses RRD. When you cut it down to a one hour window, you can clearly see the 5 minute slots and a peak of 9Mb.
The graph for the same port from The Dude however tells a different story. As it polls every few seconds, it saw the true peak in bandwidth.. a spike of almost 60Mb.
In summary: Start graphing your packet rates in addition to bandwidth rates, and if possible setup a copy of the Dude someplace and monitor your RB1000 with it for a while. This will tell if it’s either packet rates, or bandwidth rates that are causing the issue.
It occurs to me after writing this that you’re probably using the built-in graphing, which in turn uses RRD, so you may have no idea what I’m talking about XD
I had suspected a high packet rate of very small packets. I understand what you are saying about the sampling frequency and the possibility to miss small extreme bursts in the graph. However, the client who was implicated in this is on a 100Mbps uplink, so that would be the limit of his burst, and as pointed out before the router has proven itself capable of handling 100’s of Mbps… with more typical packet sizes, that is.
I am using ROS’s built-in graph for the CPU usage, and mrtg for the client VLANs. I don’t think either one has the ability to graph packet size -is this something that The Dude can do? Also, I’ve noted that The Dude can run on an RB1000. Has anyone done this? Does it run alongside ROS, and how does one access the various Dude displays and screens?
You can run it on the RB1000, but I found it made an unacceptable amount of writes to the RB’s internal flash memory and really needs to be run on an external CF flash card - but The Dude cannot graph packet rates. MRTG can graph packet size, but I am unsure of the exact configuration. According to the MRTG docs, the following is an example of a packet-rate graph config:
Thanks for looking into that. I adapted that tempate a bit for my device, then pasted it into the mrtg config file for the router and gave it a run. I get this error:
SNMP Error:
Received SNMP response with error code
error status: noSuchName
index 1 (OID: 1.3.6.1.2.1.2.2.1.11.11)
SNMPv1_Session (remote host: "r1.0" [XXX.XXX.XXX.XXX].161)
community: "public"
request ID: 1868700253
PDU bufsize: 8000 bytes
timeout: 2s
retries: 5
backoff: 1)
at /usr/local/mrtg-2/bin/../lib/mrtg2/SNMP_util.pm line 490
SNMPGET Problem for .1.3.6.1.2.1.2.2.1.11.11 .1.3.6.1.2.1.2.2.1.17.11 sysUptime sysName on public@r1.0::::::v4only
at /usr/local/mrtg-2/bin/mrtg line 2150
So I guess the above OIDs are not quite right for the RB1000. I ran snmpwalk against the router, then grep’d for strings that contain ‘pkts’ and came up with thirteen counters each for unicast packets in and out - which of these would be the right one to query to get the total pps values for the RB1000 as a whole? I’ve set it up to use the .1 counter for each direction, simply because those had the largest value stored in their registers, and it’s graphing. But it would be great to get confirmation that these are the right ones.
Each of those numbers (or indexes) will correspond to a different interface. Check your snmpwalk output for “ifDescr” which should give you an indicator of which index matches up with which interface. Eg from a Cisco 24 port switch.
Your 1-4 probably correspond to the RB’s built in gigabit ports. The rest are probably your VLAN’s, bridges or whatever other virtual interfaces you have created.