A few weeks ago, I upgraded my main backbone of tower routers to 6.32.3 from 6.18 and 6.27. Everything looked okay. We were passing traffic at about the same level as before. Yea!
We started getting a few complaints from end users all over the network. There weren't a lot of complaints and with wireless links, you can almost always find wireless issues to explain the problems. The complaints were mostly about upload rates being way down. These people usually had 500Kbps to 3Mbps upload rates and the limiting factor was that they queue tree was doing what I told it to do for their service plans. The complaints were showing best rates around 128Kbps.
A few days ago I started thinking that it had to be some underlying problem. It was just affecting too much of the network.
So, on a solid server in the colo with good GigE connections to the main router, a CCR1036, I started running mtr with big packets and short intervals. I tried using /tools traceroute, but that throws away packets which are late to return rather than actually lost. Using tools traceroute with more than 20 packets per second would lead to 80% packet loss statistics and my network didn't have 80% packet loss. The network actually worked.
I would see it run fine for several seconds then there would be around 19 packets dropped at one or more of the hops. There would be latencies of 500+ms. The average packet loss after 20,000 to 100,000 passes was around 2%.
Code: Select all
sudo mtr -o "LSD NABWVM" -i 0.02 -s 1472 ap1.towerX
So, I looked at the the wireless links. I thought I saw a correlation between MCS rate changes and the dropped packets. I worked to fix the wireless links for a few nights. I got the wireless links to quit shifting MCS and look really stable. I still had 2% packet loss as measured by MTR. /tool traceroute showed similar packet loss from the remote end of the path coming back, it just wasn't as rapidly visible because of the /tool traceroute's limitations. Because of that I was disinclined to believe I was seeing an MTR anomaly.
Finally, I noticed that I was getting packet loss on the hop from the server to the CCR. I had no explanation for that.
I decided to take a stab in the dark and try going back to 6.30.4. When it came back up, I had no packet loss. I have occasional 500+ms pings, but they are less frequent. So I went ahead and put 6.30.4 on all the CCRs along the path to one of my problem towers. When field techs tested the next day, upload speed tests were much improved. Customers are getting their plan limited speeds.
I did notice one anomaly on downgrade. The routers have a public /30 and an RFC1918 /29 on each interface. The 1918 space is solely for management of the wireless bridge gear. OSPF runs on the public /30 space. After the downgrade, I was getting ping and traceroute replies from the 1918 IPs rather than the public IP to which the packet had been sent.
If I disable the RFC1918 address, the router starts replying form the correct IP. If I re-enable the RFC1918 address the router continues to reply from the correct public /30 IP. Unfortunately, after a reboot it goes back to replying from the RFC1918 address. I hypothesize that the RFC1918 address may have been added to the configuration first, years ago and 6.30.4 is for some reason using that as the default IP on the interface.
Code: Select all
Packets Pings Host Loss% Snt Drop Last Avg Best Wrst StDev Javg 1. cololangw 0.0% 319 0 0.2 0.3 0.2 10.9 0.8 0.3 2. 10.128.1.36 0.0% 319 0 0.7 0.7 0.6 9.4 0.7 0.2 3. towerApub 0.0% 319 0 1.2 1.2 1.0 14.2 0.9 0.3 4. towerbpub 0.0% 319 0 3.4 3.1 1.8 16.6 1.1 0.8 5. 10.128.251.4 0.0% 319 0 14.1 14.8 11.2 57.7 3.6 2.7 6. 10.128.251.130 0.0% 318 0 13.4 15.4 10.4 52.5 3.3 2.8