We have a large 500 unit MDU that we provide DIA with 1 public IP, we have noticed some tenants experiencing disconnections from certain services like Discord and some gaming servers, we suspect it could be too many connections for 1 public IP, how would you handle this issue if you only have access to 1 public IP?
Probably NAT port exhaustion. Do you have IPv6 available?
IPv6 is not an option for us right now unfortunately. What if I was able to get more public ipv4 addresses?
Yeah, it's probably related to the number of connections that you're tracking.
So, first of all, tell us what device are you using? For a large conntrack table you will need a device with quite a bit of memory (1G minumum) and a higher class cpu.
The next step is looking at the numbers provided by /ip/firewall/connection/tracking/print. It would be best to post the entire output. This verifies how many connections you have, how close you are to the limit, and whether the timeouts are reasonable.
Third, adjust the timeouts to something reasonable.
Fourth. In my humble experience, most of these sorts of conntrack exhaustion problems are due to some users having some device that makes lots of connections. They may be infected with some sort of malware/trojan, etc. I've even seen phones have these.
It's basically impossible to deal with these one-by.one, however there's a handy-dandy connlimit feature in the Mikrotik firewall, which can be used to limit the number of connections per tenant to something reasonable, like 10k. This last one usually solves the rest of the problems.
All said, 500 units is not that many, and I would venture to say that it's probably possible to handle them with one external IP.
If it turns out not to be possible, Mikrotiks can fully be configured to use any number of additional ones. But obtain an additional IP as the last step, because if you don't fix the root cause, which is probably malware, the additional address won't help.
I'd also do queueing, if you don't already. Dropping interactive connections is also a symptom of a lack of or misconfigured bandwidth allocation.
[user@CCR2004] /ip/firewall/connection/tracking> print
enabled: auto
active-ipv4: yes
active-ipv6: no
tcp-syn-sent-timeout: 5s
tcp-syn-received-timeout: 5s
tcp-established-timeout: 1d
tcp-fin-wait-timeout: 10s
tcp-close-wait-timeout: 10s
tcp-last-ack-timeout: 10s
tcp-time-wait-timeout: 10s
tcp-close-timeout: 10s
tcp-max-retrans-timeout: 5m
tcp-unacked-timeout: 5m
loose-tcp-tracking: yes
liberal-tcp-tracking: no
udp-timeout: 30s
udp-stream-timeout: 3m
icmp-timeout: 10s
generic-timeout: 10m
max-entries: 1048576
total-entries: 12647
total-ip4-entries: 12647
total-ip6-entries: 0
Well, number of entries is low. Almost too low. Was this generated in normal steady-state operation?
Anyway, your problem is not conntrack. So it's most likely improper or absent queueing.
You’re servicing 500 units with a CCR2004? Really hope you aren’t charging them a monthly fee, considering how anemic that thing is.
That aside, you need to track connection, CPU, memory, and interface usage continuously that way the next time someone complains you have at-the-time numbers to reference. Since a CCR2004, according to the docs, is limited to roughly 1.1 million packets per second with 25 ip filter (firewall rules) enabled, that means best case it can handle ~14Gbps or worst case 636Mbps (depending on packet sizes). It would only take a handful of power users in the 500 units your servicing to blow out that bandwidth (depending on the speeds you’re offering them, obviously).
I think you're being a little judgmental and irrational. The story usually goes something like this: all the property can get is a slightly premium service from the local ISP. It is maybe 1 Gbps, maybe even 2.5 Gbps. In many places some service, however anemic, is necessary. At least it's better than none at all. The price is usually included in whatever other fee, i.e. gratis.
As long as the router can handle whatever bandwidth they manage to scrape together, it's capable. The only reason to choose something a bit beefier is because of the number of clients/devices/connections.
EDIT: So if I'm right about the circumstances (my crystal ball doesn't need to be sent for calibration again,) then the CCR2004 is a totally fine choice. Maybe even a bit too generous.
A 500 unit building is pretty sizable, and would provide enough revenue per month to convince a larger ISP to provide significantly higher speeds than 1-2Gbps in much of the world. It’s far more likely OP works for a cut rate shop / bargain bin “ISP” (i.e., middleman) but until they actually respond with hard info, it’s really anyone’s guess.
We'll see. That is if OP is forthcoming. I'm speculating as well.
In many places in the world, especially if your in some non-central location, it's totally usual that there is a huge price gap between getting some sort of "residential" endpoint (charged at a higher rate, owing to the much higher use) because that's what the infrastructure is there for, e.g. it's delivered over DOCSIS. While the entire DOCSIS network has way higher total bandwidth, there simply isn't any sort of terminal device available that would be compatible and deliver e.g. 10 Gbps. To have a proper fiber connection, the entire network path between the ISP's POP and the given location would have to be excavated, because while the current coax connection works, much of its path is in a bad condition. (In case it breaks, they're willing to dig up the one spot.)
In these situations you a) either make do with 1-2-3-4 of these residential connections, b) pay through the nose for laying fiber c) invite the ISP in, and then pay for each unit separately.
If this is some sort of place where people don't really expect broadband (youth hostel, cheap rental, etc.), contrary to what many would assume, a few hundred people can easily get by with a shared gigabit.
But you're totally right: only OP knows the story.
I’ll give you some more background on our setup. We provide building wide wifi and use ppsk.
Each unit is on their own vlan and dhcp server, and they connect to the wifi using ppsk.
So our router has a vlan and handles all dhcp servers for every unit in the building.
hmm
if you use 1 PUB IP, and port range from 10.000 to 60.000
it is 50.000 available ports
and 500 deterministic CG-NAT users
experiencing disconnections from certain services like Discord
so it is only 100 port per user ![]()
Port-per-user only tells any story when you view it as CG-NAT, and you want to allocate port ranges explicitly and separately. Neither of these is likely true. OP seems to be running some sort of "extended LAN".
He currently has 12.6k connections altogether. Just for your benefit, I've looked at my home lab's external conntrack usage, and I'm tracking 8.6k.
According to your arithmetic, my setup is on the brink of impossible; yet it works fine. Even beyond 20k connections it works fine. Yes, I only have one public IP.
I had probably 100 or so customers behind a CCR1009 and ran into the same problem. Looking back, I was exhausting some of the NAT connections (connection tracking or conntrack). Moving to a newer, bigger router helped some things, but ultimately yes, having more IP addresses available was the real fix. I now NAT approximately 100-150 subscribers per IP address.
As I type this, it looks like someone else already did the math. I’ll expand it a little bit.
If you do more than he suggested, and take 65,535 total ports, subtract the first 1024, and divide the result by 500, you get roughly 129 ports per unit. Most CGNAT systems advise 500-1500 per subscriber. (The number of devices in the home/business is a better factor to use; small and medium businesses for example are likely to need more ports than a single-family home.)
To have 500 ports per unit, you could make my math work with 4 IP’s. If your upstream provider can route a /28 to you, that would give you 5-8 addresses (depending on how you configure things), with room for 322K-516K ports, roughly 645-1032 per dwelling.
Some CGNAT guides will recommend specifically assigning a block of TCP/UDP ports from external IP’s to internal IP’s. Unless needed for legal reasons, it wastes ports and you risk starving the handful of heavy users who max out their allotment.
Personally I map a group of internal IP subnets to a single IP, and that has worked well for the past six years.
When the CCR2004 (12SFP+ version) first came out, I used it as a border router for my ISP network. I don’t recall how many customers we had, probably 200-300 homes.
While it is spec’d to handle 10-20Gbps of bandwidth, it seemed to peak around 60% CPU and ~3Gbps, with the minimal amount of filters I had on it. Most of the day it held steady around 30% utilization at 1.5Gbps, so we weren’t necessarily maxing out the CPU, but throughput just wouldn’t budge (had a 10G uplink on it).
I replaced it with a 2116 (saw immediate improvement) and shelved the 2004’s for a while. I put them on the bench last year and did what I could to try to get to 10 and 20Gbps. The real challenge, at least for that particular model, is the PIPE thing. The CPU is less involved somehow if you put uplink connections on the SFP28 ports and downlinks on the SFP+ ports. If you try to go from SFP+ to SFP+, it taxes the CPU more.
In the end, I think we were running into PPS limits on that thing. With a 1-2Gbps connection, which this MDU likely has, it should be fine.
"my arithmetic" ... it is common sense ![]()
since we have no info how OP nat-ed 500 users to 1 PUB IP, my writing is wild guess
but, IF it is deterministic, where 1 user have 100 port pre-allocated, this ONE user could easy exhaust all 100 preallocated port
and it has nothing to do with TOTAL number of connections
My comment was regarding the "if".
But OP is not willing to clarify, so we will probably remain in the dark.
I do not have much experience with this sort of IP mapping, does anyone know of a good guide I can look at for this? Sounds like I should try using more Public IP’s and see if that helps.
Our router currently just uses the default masquerade nat rule.
The whole point of this discussion was that it's much better to use the default masquerade rule. Don't look at changing that.
I still stand by my recommendation that with this user count, implementing queueing is not optional. It's mandatory. This is what will probably help in your situation.
I still stand by my recommendation to first fix your problem, which is very likely to be the lack of queueing then if you find it necessary, obtain more IP addresses. Again, if you only obtain additional addresses, that is unlikely to help you.
EDIT: It would be helpful if you could re-run the tracking/print command when the system is under usual load. I have a suspicion that your previously posted output was not in taken in this situation.
I will run that command again as soon as I can when the system is under load.
AFAIK neither connection tracking nor NAT require that local port is unique per connection. NAT and connection tracking use quadruple "src-address, src-port, dst-address, dst-port" to uniquely identify each connection. Meaning that (assuming SRC NAT) pair "src-address, src-port" can be used many times as long as "dst-address, dst-port" pair is different. So it's fine to use same src-port on wan side to connect to different remote servers (e.g. google.com and speedtest.net) even if remote ports are the same (e.g. 443).
Yes, it is customary (for userland applications at least) to open new unique port when creating outgoing connections, but IMO that's mainly to do with the fact that many apps create multiple connections towards same server service (so dst-address and dst-port are always the same) ... and on end-user machine usually there ample of free ports.
So yes, such "CG-NAT" server would have a limit of around 50000 concurrent connections to same remote server and service, but that's pretty unlikely to happen.