We are having an issue with MPLS Traffic Engineering where intermittently the router will peg the CPU at 100%. It’ll stay at 100% until we disable our Traffic Eng Interfaces, at which point it will return to normal immediately. As an example, today we added six TE interfaces, using static traffic paths (no CSPF), and statically routed subnets onto them (either directly or with ECMP by duplicating the gateway). About three hours later the CPU maxed out, and the profiler would show that 99% of the CPU was being used by “Networking”. This has happened intermittently, from a few hours to a day after enabling the TE interfaces.
We tried to generate a supout.rif while the issue was ongoing but after a few minutes it was still stuck at 1% and we needed to get our clients back online. The router is a CCR1072, firmware v6.44.2, which ordinarily sits at around 30-35% CPU with 12.5/15.8GB free RAM. It is routing about 4 gbit and 500kpps aggregate. It has 9 queue tree rules, no simple queues.
In terms of firewall rules - the filter rules are fairly standard port and ssh brute force blocking. There are a large number of srcnat rules (~9500), scripted for CGNAT, but they use per-/24 jump rules so that few of them are executed for most traffic. Almost all traffic is checked against a single ~3400 entry address list in a mangle rule to mark traffic for the queue tree traffic shaping mechanism. There is also a TZSP sniff rule to sample TCP ACK frames for latency analysis, however disabling this rule has not affected the CPU issue. There’s also a rule to clamp TCP MSS for traffic matching a src address list (with only two entries). Traffic Flow is enabled for netflow monitoring.
The MPLS config is only MPLS/TE (enabled via the OSPF instance), with three local “Interfaces” entries enabled with default settings (1G Bandwidth). The six “Traffic Eng Interface” entries are set to use 1496 MTU, a Bandwidth of 1, with a primary path defined and no secondary path. The rest of the options are defaults. There is a large number of tunnel paths in the router (~1000) left over from a scripted mechanism, but only six were in use by TE interfaces, all with strict static hops and CSPF disabled. LDP is not enabled.
Of the total traffic, only about 500mbit was being routed to the TE interfaces.
BGP is enabled only for the upstream interface and the router does not contain the full Internet routing table (only local routes, ~700 total). OSPF is used for the local network and no other routing protocols are enabled.
Anything we can do to track down the cause of this issue would be very helpful. Thanks!