Well, maybe because English is not my native language I was not clear. When I say "tens", I refer to 23 one time, 45 next, 30 next and so. (In Spanish the word is "decenas", and allways thought the English word "tens" meant the same).Javier, it is very weird, that only 10 connections are closed.
Perhaps you can enable pppoe,debug logs and get the reason, why all the 10 clients were disconnected simultaneously.
Otherwise it is very hard to guess, what could be the reason.
Just curious, do you have RADIUS server for these 300 clients?
Tue Dec 8 00:10:31 2009 Packet-Type = Access-Accept Mikrotik-Rate-Limit = "128k/256k 130k/258k 129k/257k 3/3 8 32k/32k" Framed-Routing = None Framed-Protocol = PPP Service-Type = Framed-User
add authentication=pap default-profile=PPPoE disabled=no interface=\ PPPoEVLAN20 keepalive-timeout=10 max-mru=1480 max-mtu=1480 max-sessions=0 \ mrru=disabled one-session-per-host=yes service-name=XXXX add authentication=pap default-profile=PPPoE disabled=no interface=VLAN10 \ keepalive-timeout=10 max-mru=1480 max-mtu=1480 max-sessions=0 mrru=\ disabled one-session-per-host=yes service-name=XXXX set default change-tcp-mss=yes comment="" name=default only-one=default use-compression=default \ use-encryption=default use-vj-compression=default add change-tcp-mss=yes comment="" dns-server=10.120.0.2,10.120.0.3,10.120.128.2 local-address=\ 200.xxx.xxx.xxx name=PPPoE only-one=default remote-address=POOL-PPPoE use-compression=default \ use-encryption=default use-vj-compression=default wins-server=127.0.0.1
#1 was relatively easy do do.1) replace all your dynamic change-mss rules with one global change-mss rule.
2) check that you use latest winbox loader (cleare cache after upgrading)
3) Think about switching from dynamic simple queues to Dynamic address-list
and queue tree with PCQ
What is that ? Where do you set it up ?
edit. My PPPoEs ask for interim updates, might this help?
interim-update - defines time interval between communications with the router. If this time will exceed, RADIUS server will assume that this connection is down. This value is suggested to be not less than 3 minutes
# max_requests: The maximum number of requests which the server keeps # track of. This should be 256 multiplied by the number of clients. # e.g. With 4 clients, this number should be 1024. # # If this number is too low, then when the server becomes busy, # it will not respond to any new requests, until the 'cleanup_delay' # time has passed, and it has removed the old requests. # # If this number is set too high, then the server will use a bit more # memory for no real benefit. # # If you aren't sure what it should be set to, it's better to set it # too high than too low. Setting it to 1000 per client is probably # the highest it should be. # # Useful range of values: 256 to infinity # max_requests = 1024
I have to mention: We do NOT use interim-updates and suffer the same problem.So the main question is:
Does PPPoE server on MikroTik drop connections if there are timeouts on interim-updates...?
Has anyone been able to check their logs to see if they recognize any similarities before the "crash" of the PPPoE servers. In mine, I notice that my VLAN interfaces all switch to the UP state. I don't see anything going to the DOWN state prior to this though. So I have no idea why the state changed to UP. Is anyone else logging to a syslog server that can confirm this?
since a long time already. RB1100RB1000 discontinued?!
email support. when they answer, in the subject of the email you will see a ticket number, like 2010101966000161Same problem. I notice the problem about 20 sessions and it persist until now, 215 sessions. Problem does not include pptp session connected thru WAN port. What is the "ticket numbers" ?
I hate reviving old threads from years past, but this one IMHO is worth keeping alive. We have the same issue with 1300 PPPoE sessions on a CCR1702. We are able to reliably reproduce this:
1. Drop a number of customers by:
a) Rebooting a downstream switch
b) Rebooting a PtP AirFiber serving a downstream switch
c) Pull one of the ports on the bridge serving PPPoE on the CCR
2. We will see traffic drop according to the segment lost.
3. When the disconnect completes, traffic resumes.
4. About 2 minutes after traffic resumes, ALL traffic stops at the CCR, and PPPoE sessions start dropping - sometimes it's all of them, sometimes only a portion.
CCR remains accessible during these events, but no amount of CPU profiling has pointed to anything specific. Mikrotik support ended up shrugging and said "our hardware won't support your current configuration" without further details. The interface is a 10Gbps fiber, so this is not a "you're choking your 1G link".
I think this problem is embedded deeply in the core of the operating system, and thus has not been fixed during years of development, upgrades and fixes.
At this point, we are looking at alternative vendors, at a loss of thousands of dollars to Mikrotik (we are a credible ISP in Eastern Africa with some 15.000 customers... and plans for growth to 200.000 customers).
Thanks for your input and suggestions - we are definitely contemplating the x86 metal + dedicated PPPoE stack as an option.
On the connection tracking disabled - how would you handle dynamic rate limiting without it? We use a simple queue for each CPE session, assigned based on RADIUS response (and the service level set on the customer DB). We also (in some cases) use mangle rules to direct traffic where we have more than one upstream link (e.g. two parallel 1Gbps fibers).