hello,
Posting again - yesterday posted mistakely in forwarding forum (114 views in 20h), user can’t move topic, reported to moderators - they didn’t.
I have strange issues with RB750UP (5.21). It has a quite long uptime (over 130 days), worked rather stable but from time to time was going high cpu usage. Earlier I was able to left windbox working even for a few days, today I wasn’t able to connect with winbox/putty/mac-telnet.
It’s not overloaded, transfers (routing+NAT) max 20Mbps. I tried some queuing methods recently to get more thruoutput - works better now (peeks to 20Mbps, earler max 12), can be done better, but this is not most important now. About 100 SQ in groups/tree/time - maybe 20 active in max.
As wrote earlier sometimes was reaching 100%CPU (maybe on peek transfer) - tools/profile always pointed to FLASH/storage problem (flash 100% in profile). I didn’t dig into problem source, just switching off winbox for some time to reduce load. It was self-healing.
TODAY I get 100%CPU again, flash 100% - I started to inspect for problem source and searching forum/google.
Someone wrote it can be a WEB-PROXY or HOT-SPOT problem - I tested H-S but not used there, WEB-PROXY I setup a few weeks ago for payment reminder. It’s a firewall/address list based redirection to customised webproxy/error.html page. Worked ok for these few weeks. FOR SURE I DISABLED WEB-PROXY.
Someone wrote it can be a SMTP problem - I used it a few times for testing, it’s not important now - I DISABLED SMTP.
For more flash usage reduction I reduced GRAPHING setting, amount of queue rules, changed save-period settings from 5-min to 24h.
I switched off LOGGING rules, reduced earlier up to switching off ALL LOGGING, even remote.
In firewall I DISABLED earlier tested logging rules, not important at this time, too. 2 active log rule for non-pay address-list, not used in truth.
I’m using OSPF/MPLS, DNS cache (max 1% dns in profiling), NAT, DHCP. All worked almost fine all the uptime, some problem with mlps but on other RB in another place.
For NOW() I have other, much WORSE ISSUES:
-
it’s PARTIALLY WORKING, routing, ospf/mpls, users transfer looks ok (viewed on another router)
-
a few days ago winbox stopped UPDATING ether-1 (gw) traffic
-
winbox shows 100% CPU on status bar, but not on “load icon” (between hide passwords and un-/secure mode)
-
profile shows MANAGEMENT eating CPU, behaves as idle (100-sum of others, firewall, queueing), flash isn’t listed or shows for short time
-
winbox (if even works) doesn’t shows data in many windows (interfaces, arp, queues, firewall, dhcp), cannot see firewall problem causing rules
- in winbox I don’t see IP/ARP entries and HAVE ERRORS IN LOG: “dhcp1: failed to add arp-entry for IP 192.168.1.124: std failure: timeout (13)” every 1 minute, a few expired hosts, in leases marked as “offered”
-
cannot ssh-login, EVEN cannot start NEW TERMINAL WINDOWS in winbox!!! (putty session timeouts or with luck sometimes authorize user/pass but hangs later after/during “welcome text-screen”, winbox’s ternminal window hangs the same way - JUST BEFORE PROMPT line) - Active Users entries appears/disappears correctly during login tries, console task appears in profiling, many times ssh sessions was timeouted/dropped
-
as cannot start any terminal I can’t export config, files/BACKUP is created normally with reasonable time
-
I can login SFTP
-
supout files created, some took a few minutes (8-10 min), some attempts failed
-
6MB free RAM, 38MB free NAND/flash
\
Maybe some internal processes tries to write some data and hangs:
- what can stop console creating session process just before prompt?
- why SFTP session works??
- what blocks dhcp’s ip/arp entry adding?
- why other processes works OK? They doesn’t write/log, doesn’t use affected core/std functions? WHICH ONE?!
I’m not familiar with system (linux?) internals but IT ALL LOOKS like general, low-level/core flash/storage/IO -related problem for me.
It affects too many tasks/processes/futures/services, it’s so many unresolved (undebuggable?) issues on forum: HS, web-proxy, SMTP, MNDP. PPP, User Manager, firewall, DHCP. What is the common element that can cause such effects? There must be another solution than “switch off future xxx” - this is not real solution, not at all. These help or not, forever or some period. There is no solution with update, too - issues exists in many different versions, mostly 5.x, earlier and even 6.x.
There is something wrong deeper in system - we need more debug assertions in core/std? There must be a possibility to test/indicate the cause of these problems. Maybe we need some additional package to test/profile/check performance/execution time of core/std functions? Maybe creating supout does some tests but it’s looks they’re not sufficient, can not detect the real cause of problems, too many of them, too many kinds. This is something that affects to the stability of the whole platform - it’s no longer rock-solid foundament.
Problem AGAIN:
- flash usage started appearing (reason? higher % firewall ?),
- cannot create supout files,
- flash usage jumps 2-30% upto 40% during supout creating attempt
- manegement usage still behaves like idle: 100-sum(firewall+queuing+flash+the rest)
- I can login SFTP BUT NOT SEE ANY CONTENT!!!
I didn’t restarted yet (remote, on production), I’m afraid it won’t start properly again. I replace that later but want to know whats going on, what is the reason.
LATER:
After restart worked quite ok, but no content in winbox/FILES (can’t backup config) - replaced, I guess Should net-install this?
Best Regards