I’ve been plagued with this issue “feb/16 16:21:34 dhcp,warning vlan8-lan offering lease without success” and have yet to figure out what the root cause is. I have built and rebuilt 3 MT devices, a CRS125, a RB2011 and most recently an RB3011. The problem has shown up in the current version of ROS v6.38.1 as well as v6.37. The configuration is very simple.
I’ve tried every incarnation of brodcast, non-broadcast, bootp options, etc. I have reset and rebuilt the config from scratch on each device. I have tried plugged directly into the device and via a switch, a MT access point. There is no common thread that I can find. The clients range from iOS, MacOS X, Linux, windows 7, and embedded devices like APC SmartUPS cards and printers (likely busybox linux). MacOS X seems to work the best but even that is inconsistent. The problem seems to worsen over time, working fine for some time and then getting progressively worse. The network is dual stacked with IPv6, it also had a small set of static mappings, but I removed them when the problem began to surface again.
Right now I have moved to an external dhcp server running isc-dhcpd and it works just fine with no other adjustments to the switch or the mikrotik device.
Any ideas, thoughts, rotten fruit to throw?
the hundreds of times ive seen this its always been because of a faulty cable by the remote side thats asking for dhcp. maybe in your case the response is not making it back to the requestor for other reasons, but usually because the cable is faulty.
That’s a good thought. I typically start my troubleshooting at layer 1, I’ve replaced the cabling (twice) during the process of seeing this and cable tested each time. Stats on the interfaces to the RB looks reasonable:
I’ve also got clients experiencing this across L2 devices, wireless, direct connections to the RB. I don’t think that all of the cabling could be bad as it is a mix of structured cabling, fiber, and direct patch cables. It feels like a bug to me, though. The CLI and web interface for the leases section goes bananas when trying to display in this state, leases showing multiple times, disappearing and reappearing, etc. This is the same behavior across ROS and hardware platforms and it seems to come and go. I would also expect the same behavior to the external dhcp server plugged into the same equipment, which does not occur.
I’ve encountered this problem at a number of subscriber installations, and never heard a good explanation of why it was happening. The only time I ever actually solved it was when it started occurring at a new installation and then several days later the service went out – I found that indeed, the outdoor cable had been chewed by pack rats. At another site where this occurs, we used a cable that had been installed by the homeowner, so that would be consistent with this diagnosis. However, at a third subscriber, we ended up replacing the entire outdoor cable for completely unrelated reasons (he wanted it rerouted), and the problem did not go away. Perhaps it’s due to a bad patch cable inside his house beyond our wiring.
In any event, it might be worth just running the built-in ethernet test for a significant period and seeing if it comes up with transient faults.
I’ve tested these cables for continuity as well as attenuation and xtalk, they all test fine. In addition, there are clients attached via fiber (fiber also cleaned, scoped, and verified). I am not convinced that this is a physical issue since I have literally replaced everything in the path, with the exception of a handful of structured cable runs connecting random remote hosts. With brand new cables, tested and verified, plugged directly into the RB I see the same behavior.
It’s moot at this point - I’ve lost all confidence in the dhcpd implementation and moved everything to an external ISC-dhcpd server and have have integrated dhcpdv6 for the managed bit as well which was on the roadmap anyway.
I’m curious-- have you replaced the Routerboard? A bad or intermittent ether port jack might create the same failure mode as an intermittent cable.
My gut feeling is that this is somehow hardware- or unit-related. DHCP server works perfectly at all but a couple of my sites, and when I see this message, it’s always at the same few sites.
Yes, I have seen this behavior with 3 different routerboards in this environment over the last 18 or so months. A CRS, an RB2011 and an RB3011. The entire infrastructure is been replaced at this point with the exception of the structured cabling, which isn’t in the critical path (and has been removed from the environment for extended periods for testing). I have even replaced a handful of it when it was convenient - anything in a chase or conduit has been tested and verified as good, then replaced, tested and verified again for good measure. I have also done resting and replacement of 3 different layer2 devices - one of which was new in box.
When I drop in an ISC based dhcpd server the problem disappears - even with zero other changes, this has been tested and verified at every change when the RB dhcpd fails. I’ve been extremely systematic and methodical during the testing, no more than one change at a time, collect data, etc.
The environment is pretty simple, too. Like I said, I’m resolved to just use the ISC server at this point. It’s a known quantity, runs on the local resolver and I have code that works with it. It’s not quite as clean but I can deal with it since it buys me managed dhcpdv6 and I’ve lost confidence in the MT implementation.
Thanks – this is a valuable data point. You should send a report outlining these findings to support@mikrotik.com. I’m fairly certain they have no idea what is going on here, as we have never received an explanation for or an acknowledgement of this syndrome.
I have had issues with this in the past and what I learned is that sometimes the lease time may be a problem … I don’t remember which model exactly, but some HP switches refused to obtain IP address if lease time was less than 2h and one printer required lease time to be maximum 1d … hope this helps…
That’s interesting information. It wouldn’t be applicable to my own experiences because we use a lease time of four hours (so we can have a reasonably close indication at any time of how many devices are likely to be connected in the household).
One other curiosity we have noticed is that the devices generating this message have been preponderantly Apple computers. But our sample size is low, plus our business runs exclusively on Apples and has never experienced this problem locally.
I mostly saw this with embedded devices, predominately based on linux - but there is no exclusivity. I saw it on some apple iOS based devices, some chromebooks, some MacOS based laptops and desktops and one Windows 7 VM. There were a number of static DHCP entries, and those are the ones that had most of the issues, but again, not exclusively. It was a mind numbingly frustrating issue, especially since IPv6 worked completely flawlessly. I’m a wide area guy so dealing with end systems frustrates me pretty quickly and ultimately I ended up killing the MT dhcp service and set up an ISC server running a linux LXC container and was done with it. I may try again since upgrading to 6.38.5 just to see if that changes anything.
Maybe in a wide area network there are different issues. When the 4-packet DHCP exchange does not complete due to packet
loss, you will have problems. But in my case there is good connectivity (an indoor WiFi network) and only Apple devices are
affected. We also have Chromebooks, Microsoft telephones, Android telephones (Samsung) etc and they all work fine.
Packet loss is a problem everywhere, but I am unconvinced that it is the issue since I replaced everything including fiber, twisted pair, ROS devices, patch cables and structured cabling. Dropping in a stand alone dhcp server solved the issue immediately and permanently. I also saw no evidence of packet loss in the tcpdump packet traces I took.
Another data point: Upgrading to 6.38.5 seems to cause this behavior pretty to occur reliably for me on multiple platforms. Downgrading to the bugfix 6.37.5 fixes the issue.
Glad you could resolve your issue. Still, if downgrading to the bugfix has resolved it would be nice if you could let support know about it, so they can look into it to avoid having a possible bug in the upcoming releases.
Oh no, I wasn’t clear. It just made it easy to reproduce in that I can update to that version and cause the behavior. It didn’t solve it for a number of locations. I still maintain that the dhcp server isn’t up to par. I’m moving almost everything to ISC at this point, which I know well and does quite a bit more. Downside is that it requires external resources to run and adds operational overhead.
Or maybe it is your configuration that is not OK. I use the MikroTik DHCP server on several networks (often 2-4 networks on a single router) without any
problem other than the occasional trouble with Apple clients.
Sure, anything’s possible. However, I should note that the data doesn’t support that since I had it occurring in multiple locations with vastly different configurations. In addition, I tore out the config on the deployment that sparked this post and re-built it with nothing more than a pool and a server on one segment and still saw the issue. Given that I have been building large, complex networks for 20 years I would have expected to see this before and in other environments if it was in fact one of my configurations causing the issue. Additionally, it only occurs with this particular dhcp implementation, and I have read and re-read the guides to validate my configurations thinking the same thing - it may be my config. I’m pretty confident that it is not.