Dude, where'd my network go?!

Fri Feb 13, 2009 4:55 am

Over the past several weeks, I've had a problem on my network. A wireless link connects a site that holds my Dude server to a central site. Looks about like this:

Dude--Cisco3550--bridge1--bridge2--Cisco1841--Mikrotik493AH FW-->RouterOS/PC BGP GW

As I've turned up more monitoring... SNMP for most devices, some OIDs in the appearances, built a number of charts that pull from existing data sources, etc.... I've found that traffic in one direction (outbound as seen from the dude) has slowed to a crawl. On a link that has solid 15mbps performance, The Dude is somehow killing 10mbps or more!!

Turn on The Dude, I get (based on above diagram)
<<<<----- 15mbps from internet to The Dude
----->>>> 3-5mbps from The Dude to the internet

Turn off The Dude, it's steady 15 in either direction. No indications that bandwidth is clogged up (neither in The Dude's graphs nor on the router interfaces themselves). No device shows more than 6mbps in either direction sustained.

Pings, CPU utilization on maybe 40 Mikrotik devices. Graphing bandwidth utilization via SNMP from about 8 Cisco devices and a couple of Mikrotik devices. A fair amount of stuff, but not obscene... 30 second intervals on most polls not helping anything...

All traffic on the network ends up going thru that Cisco 1841 and the 493 firewall, so I haven't been able to identify which is at fault. Traffic from other locations (going across other wireless links than the one pictured above) also affected, so the problem really seems to be either the 1841 or the 493, but I haven't been able to isolate which.

Anyone seen a problem like this? It's as if the 1841 or the 493 is getting overloaded, but in what way? Not bandwidth, processor never exceeds 20% on either, plenty or memory on both devices... Cannot find any significant evidence of errors of any sort!

Throughout numerous Dude upgrades, major map rebuilding, and such, I've had serious concerns about data corruption... things disappear, charts are virtually unusable... devices have been deleted, added, deleted, added again... Is it possible that the Dude is just sending out a bunch of corrupted packets or something? I cannot imagine that it's overloading the packets/second cabability of either 1841 or the 493, but that's the best thought I have.

Seems like the problem started right around the time I upgraded from 3.0rc3 to 3.1, though its difficult to say for sure.

I'm going to do several things to try to address the problem... Move dude to the central site (make bandwidth utilization more efficient), cut polling times back to 5 minutes, cut back on some of the SNMP monitoring, eliminate CPU monitoring, maybe kill some charts.... Hate to do it, but I think I'm also going to start from scratch, completely rebuild from a fresh new copy.

I'd really like to know where my network goes when the Dude is running, though... Cannot explain the missing 10mbps!!

