Still fighting with Ecobee (and losing)

I continue to be having Ecobee thermostat disconnects.

I am pretty sure this is NOT a Mikrotik problem, but I am hoping people here can help me figure this out.

The disconnections happen at random times. Weeks can go by without a disconnect; or days.

The only way to ever get the Ecobee reconnected after a disconnect is to power cycle the Ecobee (pull it out off the wall and then put it back).

The environment is a Ubiquiti LR6 AP connected to a hEX.

The Ecobee’s at this one location (8 thermostats in one house) all use a dedicated wifi network (SSID) named ECOBEE.

The ECOBEE network has a DTIM of 4, no minimum data rates, 2.4ghz only, no multicast enhancement, no PMF.

Rebooting the AP or the Hex does not result in reconnection.

The only clue I have is that if I packet sniff on the Hex (filtered by the disconnected Ecobees), I get the many of the following:

0000: ff ff ff ff ff ff 44 61  32 b3 7a d1 00 06 00 01  ......Da 2.z.....
0010: af 81 01 02 00 00 00 00  00 00 00 00 00 00 00 00  ........ ........
0020: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ........ ........
0030: 00 00 00 00 00 00 00 00  00 00 00 00              ........ ....

I put this in a packet decoder and it means:

ecobee_b3:7a:d1 → Broadcast LLC I, DSAP NULL LSAP Individual, SSAP NULL LSAP Response

So, the Ecobees appear to be sending packets, and the AP appears to be receiving those packets and putting them on the wire to the hEX.

But that is the extent. I am not well-versed enough in wifi protocol, so I am reluctant to use terms like “connected” but it seems like a connection (used loosely) is not actually established.

Anyone have any idea what is failing?

Do I really have to ask for the config again… not your first rodeo :wink:
I have ecobees, no issues, main router is ccr1009, but not MT aps.

Just another taste of blissfully ignorant youth! :laughing:

Thanks anav, but this is not an MT config issue.

I have a number of other locations with Ecobees and they are fine.

Various forums are full of people reporting, for years, horrible difficulties with Ecobees.

Just hoping the great minds here understand the decoded packet and why the connection doesn’t progress past it.

I’ve also been thinking about getting some Ecobees. Do you use them standalone, or together with something like Home Assistant or another system?

I use them with Home Assistant and have been (generally) happy with them. The combination of Ecobee and HA is extremely powerful.

But, like so many (all?) other technical devices, a little luck and a lot of patience is required.

Here are some pics of how I use them (I’m sure there are other ways also).
Screenshot 2025-02-07 060111.png
Screenshot 2025-02-07 060650.png
Screenshot 2025-02-07 060545.png
Screenshot 2025-02-07 060145.png
Screenshot 2025-02-07 060130.png
Screenshot 2025-02-07 060758.png
Screenshot 2025-02-07 060736.png

Ahh I dont use them with home assistant. They connect to the internet and I use my APP to control them.

And that also works well.

But lots of people have these random connection problems. As I do, but only at this one location.

Maybe the S in IOT does not stand for Security, but for Stability. :laughing:

Hi,
I know nothing about these, but my (only) guess is that it might have something to do with DHCP.
Perhaps if it fails to get an IP it eventually stops asking, or something.

Check none of the delay options are configured on the dhcp server.

Set up a script/netwatch to periodically ping/arp ping one or more of the devices. Maybe emailing when one fails.
Does the failure have any relationship to last DHCP accepted time for that device, or any DHCP requests by the device.

I assume no DHCP options are present trying to push devices onto an alternate vlan (like you can with Phones)

You clearly know far more than I do about DHCP server options.

It very well could be some sort of DHCP timeout, on the Ecobee side, that then prevents it from asking again.

The DHCP server in question is the hEX at the site, that the AP is wired to.

I could set the IP of each ecobee (PITA, but well-designed experiment to rule out a DHCP-related problem). It just means going around to all 8 tstats at a location 100 miles away.

I also thought of adding a Mikrotik AP for use just by the Ecobees – this would eliminate the Ubiquiti AP (which is “controlled” by a remote UDM controller) as the problem.

My bet, however, is that the frames being received from the Ecobee hold the answer – or at least the clue.

Someone, somewhere, must have seen a situation where a device repeatedly sends out those frames and does not get an answer – and, from that information, knows what the problem is.

The frame I am referring to is:

ecobee_b3:7a:d1 → Broadcast LLC I, DSAP NULL LSAP Individual, SSAP NULL LSAP Response

Is dhcp snooping enabled on any of the ubiquiti gear, it might cause this.
(Apparently it may be a default in some of their gear)

Good idea, but no, DHCP snooping is not enabled.
Screenshot 2025-02-12 185214.png

A few months ago I ran into my ecobee 3 lite deciding to not connect to my wapAC (or AX later).

After trying many many things I ended up using different SSID’s for 2.4 and 5ghz. (Writting this here, maybe using ACL and blocking the ecobee from 5Ghz radio would have worked…).

I never figured out the exact reason why it wasn’t working anymore, but I think either the ecobee or routerOS has a mixup and the client doesn’t get registered properly. It is indeed very odd that even a reboot of both devices sometimes doesn’t fix it.

Side note: Other times my ecobee dropped out was with a furnace filter that was too clogged… It seems to cause a power dip and the ecobee resets.

I’m also using the ecobee local only through home assistant.

I implemented a separate, dedicated 2.4ghz wifi network for use exclusively by the Ecobee tstats in a house and it improved things tremendously, but not entirely.

Next on the troubleshooting is to remove DHCP from the Ecobee’s environment – that is, assigning static (local, private) IP addresses to each Ecobee.

From all I read, the Ecobee wifi implementation is just not great. (Yes, many, many people use it without difficulty.)

I’m quite confident the problem is not Mikrotik’s (or any other specific manufacturer’s) implementation. However, the tweaking of the less-well-known parameters of each manufacturer’s wifi implementation probably plays a role in the Ecobee’s reliability.

I think normal, expected hiccups such as power spikes, brown and black outs, momentary wifi disconnections/drops (due to any number of reasons), band or AP roaming/changing, etc., are not handled (or, recovered from) well. And, we therefore have to control as many of these “events” as possible.

I have on my (long-term) to-do list using automated Shelly switches to power-cycle (reboot) the ecobee’s when their connection drop (for some length of time).

Resurrecting this old thread that pertains to an ongoing issue.

I hope this is helpful to others.

Ecobee WiFi Disconnect Research

Environment

16 ecobee smart thermostats are deployed across 6 locations.

The locations use a mix of MikroTik routers and access points (hAP ax3, hEX, RB5009, CRS326, Cubes, wAPs, NetMetal, cAP ax) as well as Ubiquiti UniFi equipment (UDM, LR6, AC Mesh) for networking.

All ecobees connect via 2.4GHz WiFi and maintain persistent TLS connections to the ecobee.com cloud (161.38.184.18:8190) for remote control and monitoring.

Home Assistant integrates all thermostats via the ecobee cloud API.

For years, ecobees across multiple properties have been randomly losing their WiFi connections and failing to reconnect without a physical power cycle. The disconnects appeared random -- different locations, different times, no obvious pattern. Extensive troubleshooting had ruled out WiFi configuration (after what feels like every possible combination of every single , signal strength, and AP compatibility.

I continued to investigate because the burden of physically power cycling the thermostats, as well as randomly losing connectivity, is substantial.

In the ongoing attempt to understand what is causing this, I built a continuous packet-capture pipeline that catches and saves the 60 minutes of network traffic preceding each disconnect.

The discovery is fascinating.

Root Cause: TCP SYN Scan Crashes Ecobee WiFi SoC

Ecobee thermostats use an embedded WiFi System-in-Package for 2.4GHz 802.11 b/g/n connectivity. The ecobee4 generation is confirmed to use a Qualcomm Atheros-based WiFi radio; the specific chip is unverified but likely an AR4100 or QCA4002 based on the feature profile (2.4GHz only, SPI interface, no BLE, no WPA3). Newer ecobee models (Smart Thermostat Premium, ECB601) use a MediaTek WiFi/Bluetooth IC instead.

The exact chip matters less than the class of device: all are low-power embedded IoT WiFi SoCs with minimal RAM (typically 128-512KB), lightweight TCP stacks, and designs optimized for a handful of concurrent connections.

A full-port SYN scan sending 65,535 packets at 1,500/second is orders of magnitude beyond the design envelope of any such chip, regardless of manufacturer.

When this WiFi SoC receives a high-rate TCP SYN scan (port scan), its IP stack crashes. The device enters a state we can call "wedged" wherein:

  • The WiFi radio stays associated to the AP (still transmitting LLC/XID frames)
  • The IP stack is completely dead (no ARP replies, no TCP, no UDP)
  • The ecobee.com cloud continues to report "connected" using stale cached data
  • HA may continue showing temperature readings from cached cloud data
  • The device will NOT recover on its own -- requires a physical power cycle

The Incident

On 2026-05-03, my monthly scheduled (and on-demand) from-the-inside-of-the-lan full nmap scan ran: nmap -sS -p- -T4 --min-rate 1500

This scans all 65,535 TCP ports per host at 1,500+ SYN packets/second. Four ecobees crashed within minutes of being scanned.

The pcap archives show the exact SYN flood hitting each ecobee immediately before it went silent. Ecobee-1 capture shows the transition in detail: normal TLS traffic to peer ecobees on port 1201, then XID LLC frames begin appearing interspersed with IP traffic, then a burst of mDNS re-discovery queries for all peer ecobees, then pure LLC XID frames only.

Ecobee Failure Signature in Packet Captures

A healthy ecobee produces:

  • TCP traffic to 161.38.184.18:8190 (ecobee.com cloud, persistent TLS session)
  • TCP traffic to peer ecobees on port 1201 (local mesh, TLS encrypted)
  • Periodic mDNS announcements (_ecobee._tcp.local)
  • ARP replies

A wedged ecobee produces:

  • LLC XID frames only (Basic Format, Type 1 LLC, Class I, Window Size 1)

  • Every 6-30 seconds, irregular intervals

  • No IP traffic of any kind

  • No ARP replies

  • WiFi association maintained (AP still sees the client)

      ### Ecobee Hardware Details
    
  • Local mesh: TCP port 1201 between peer ecobees on same subnet

  • mDNS service: _ecobee._tcp.local

How to Determine if an Ecobee is Connected

Do NOT rely on:

Home Assistant state: HA gets ecobee status from the ecobee.com cloud API. The cloud caches connection state and is unreliable in BOTH directions:

  • May report "connected" with temperature readings for hours after an ecobee has actually crashed. On 2026-05-03, HA reported two different ecobees as connected with live temperatures when both had been in the wedged LLC-only state.
  • May briefly report "unavailable" for ecobees that are actually connected. On 2026-05-04, HA triggered ha_unavailable alerts for two ecobees while both were confirmed connected via conntrack. These are cloud API glitches.

The HA listener is still useful as ONE trigger for archiving pcap data, but its state should never be used as the sole determination of ecobee connectivity.

ICMP ping: Ecobees may not respond to ICMP ping even when fully connected. This is normal behavior for the ecobee WiFi SoC, not an indication of a problem.

Reliable methods:

MikroTik ARP table (most reliable across all sites):

/ip/arp/print where mac-address~"44:61:32"

Interpretation:

  • Status "reachable" or "stale" WITH a MAC address = ecobee has working IP stack
  • Status "failed" or entry missing/no MAC = ecobee IP stack is dead, needs power cycle
  • ARP entries expire naturally, so absence alone is not proof of failure -- but "failed" status IS proof (the MT actively tried to reach the ecobee and got no response)

MikroTik connection tracking (definitive when visible):

/ip/firewall/connection/print where dst-address~"161.38.184.18"

Interpretation:

  • An "established" TCP connection to 161.38.184.18:8190 = ecobee is definitely connected to the cloud right now. The flags "SAC" (Seen-reply, Assured, Confirmed) prove bidirectional traffic.
  • HOWEVER: on MTs with fasttrack/hw-offload enabled, established connections are offloaded to hardware and do NOT appear in the connection table. Empty conntrack on these devices is inconclusive.
  • On some devices, the hardware switch chip means traffic is routed through CPU, so conntrack entries ARE visible.
  • On some devices, conntrack entries are visible (confirmed working).

CRITICAL: Check conntrack on the ROUTING device, not the AP.
Some devices might be in a bridge/AP only -- and do not route or NAT ecobee traffic -- so the actual gateway/NAT needs to be checked for conntrack.

/ip/firewall/connection/print where dst-address~"161.38.184.18" and src-address~"<ecobee-address>"

The same hardware-switch-bypass principle applies to devices regarding the TZSP sniffer -- only sees mDNS (multicast) from the ecobee, not unicast cloud traffic, because the switch chip forwards unicast frames to the upstream port without involving the CPU. This means pcap capture from the those device's sniffer shows only mDNS even when the ecobee is fully connected -- do not interpret
mDNS-only pcap traffic as a sign of failure on this device.

Summary decision tree:

  1. Check ARP -- if "failed" or missing with no MAC -> DEAD, power cycle
  2. Check conntrack -- if "established" to 161.38.184.18 -> ALIVE
  3. If conntrack empty and fasttrack enabled -> check ARP only
  4. HA status -> use as supplementary info only, never as sole determination

Protections Against Future SYN Scan Crashes

Layer 1: MikroTik Firewall (defense in depth)

Firewall rules on each ecobee-serving MT rate-limit inbound SYN packets to 5/second with a burst of 10. This is 150x more than an ecobee needs in normal operation (1-2 new TCP connections per minute) but kills any port scan dead.

Each MT has:

  • Address list "ecobees" containing the local ecobee IP(s)
  • Forward chain rule: accept SYN to ecobees at 5/sec burst 10
  • Forward chain rule: drop excess SYNs to ecobees

Layer 2: Scan Script Exclusion

Scheduled scan should explicity exclude all ecobee devices by IP.

Layer 3: Reduced Scan Aggressiveness

The monthly scan was changed from:

  • T4, --min-rate 1500 (extremely aggressive, ~4 hours)

To:

  • T3, --max-rate 300, --scan-delay 500ms (gentle, ~8-12 hours)
  • Scheduled 10pm UTC on the 1st of each month (was 3am)
  • Cron on dos-checker: 0 22 1 * * ...

This protects ALL devices on the network, not just ecobees. The original rate was aggressive enough to destabilize MikroTik SSH services across multiple sites.

Packet Capture Pipeline

Architecture

The pipeline runs on a linux LXC in Proxmox (named: "dos-checker") and captures ecobee WiFi traffic from two types of sources:

TZSP (MikroTik /tool/sniffer): The MT filters traffic by ecobee MAC and streams raw Ethernet frames via TZSP (UDP/37008) to dos-checker. Used for sites where the MT device is the WiFi AP. (Except where the MT device is a simple bridge.)

UniFi SSH tcpdump: app on dos-checker SSHs into UniFi APs and pipes tcpdump output back.

In a mixed environment (Unifi APs and MT router) the MT router might not be able to sniff Ethernet traffic because its hardware switch chip forwards frames in silicon, bypassing the CPU entirely.

Rolling Buffer

Each source writes 60-second pcap files, keeping 60 files = 60 minutes of history. On disconnect, the archive script copies the entire buffer plus creates a MAC-filtered merged pcap for quick analysis.

Disconnect Detection (two independent triggers + SYN flood alert)

  1. HA WebSocket listener: Subscribes to state_changed events. When any ecobee entity transitions to "unavailable", triggers an archive.

  2. Activity monitor: Scans pcap files every 60 seconds, counting IP frames per ecobee MAC. If a previously-active MAC goes silent for >15 minutes, triggers an archive. This catches the "wedged but RF-active" state that HA and the cloud miss.

  3. SYN flood detector: Integrated into the activity monitor. Each scan cycle checks for TCP SYN packets destined to ecobee MACs with more than 10 unique destination ports. Normal ecobee traffic hits only 2-3 ports (8190 cloud, 1201 mesh). A port scan hits hundreds. If detected, fires an immediate email alert BEFORE the ecobee crashes. This catches scans that bypass the MT firewall (e.g., from the same L2 subnet as the ecobee).

Archive Output

Each archive contains:

  • metadata.txt (entity, MAC, trigger reason, timestamp)
  • source_/ecobee_.pcap (raw per-source pcap files, full 60-min buffer)
  • filtered_MACADDR.pcap (merged, MAC-filtered pcap for quick review)

Archives are saved to a file.

Key Files on dos-checker

File Purpose
/root/ecobee_telemetry/tzsp_receiver.py TZSP UDP listener, writes rotating pcaps
/root/ecobee_telemetry/ha_disconnect_listener.py HA WebSocket, triggers archive
/root/ecobee_telemetry/activity_monitor.py IP silence detector, triggers archive
/root/ecobee_telemetry/archive_pcap.py Copies buffer + builds filtered pcap
/root/ecobee_telemetry/unifi_pcap_streamer.sh SSH tcpdump to UniFi APs
/root/ecobee_telemetry/health_monitor.sh 15-min cron, checks everything
/root/ecobee_telemetry/apply_ecobee_fw.sh Applies MT firewall rules
/root/ecobee_telemetry/setup_mt_sniffer.sh Configures MT TZSP sniffer
/root/drop/ecobee_telemetry/rolling/ Live pcap buffers (8 subdirs)
/root/drop/ecobee_telemetry/archive/ Disconnect archives
/root/drop/ecobee_telemetry/disconnect_log/disconnects.jsonl Event log
/root/drop/ecobee_telemetry/health_monitor.log Health check log

Services (all enabled, Restart=always)

ecobee-tzsp-receiver
ecobee-ha-listener
ecobee-activity-monitor
ecobee-unifi-371-90
ecobee-unifi-371-93
ecobee-unifi-355-shop
ecobee-unifi-355-white

Monitoring Commands

Definitive per-ecobee status:
bash /root/ecobee_status.sh

This script checks each ecobee via the most reliable method available for its site: conntrack, ARP (fasttrack hides conntrack), pcap IP frame analysis.

Check for new disconnects:
tail -f /root/drop/ecobee_telemetry/disconnect_log/disconnects.jsonl

Check pipeline health:
tail -50 /root/drop/ecobee_telemetry/health_monitor.log

Check all service status (on dos-checker):
systemctl is-active ecobee-{tzsp-receiver,ha-listener,activity-monitor,unifi-371-90,unifi-371-93,unifi-355-shop,unifi-355-white}

Hard-Won Lessons

  • mergecap snaplen mismatch: TZSP pcaps have snaplen 65535, UniFi have 2048. mergecap produces pcapng with mixed snapshot lengths. tcpdump 4.99.3 silently fails to read these (0 packets). Fix: filter each source file individually with tcpdump, then merge only the filtered results.

  • Certain MT hardware switch chip: RB750Gr3 forwards Ethernet in silicon. Cannot use /tool/sniffer for local traffic. Must capture at the UniFi APs instead.

  • UniFi SSH rate limiting: BusyBox sshd on UniFi APs blocks source IPs that reconnect too rapidly. Fix: exponential backoff in the streamer (10s->60s cap).

  • tcpdump -Z privilege drop: Local tcpdump drops to user 'tcpdump' after first file rotation, losing write access to /root paths. Fix: -Z root flag.

  • Activity monitor CPU: Original per-MAC scan ran ~1000 tcpdump invocations per 60s cycle (15 MACs x ~80 files). Fix: single-pass scan per file, count all MACs in output. Reduced to ~80 invocations, 1.4s scan time.

  • Duplicate logging: systemd StandardOutput=append and Python manual file writes to the same log file caused every line to appear twice. Fix: removed manual file writes from Python, let systemd handle it.

  • RouterOS password-authentication=yes-if-no-key: If an SSH key is imported for a user, password auth is completely disabled for that user. If key auth then fails (network issues, key mismatch), you're locked out entirely. Fix: set password-authentication=yes on all MTs so password always works as fallback.

  • bridge, not a router: For MT devices that are configured as bridges (and not routers) ecobee traffic to the is handled at the router/gateway. It does not route or NAT. Check cloud connectivity at gateway/router. The TZSP sniffer devices connected to a bridge-only will only sees mDNS, not cloud traffic.

  • Ecobee post-reboot mDNS-only state: After a power cycle, an ecobee may spend time in mDNS discovery mode (_ecobee._tcp.local queries) before establishing its cloud connection. This is normal boot behavior, not a failure. Do not confuse mDNS-only traffic with a wedged state -- wedged ecobees produce LLC/XID frames, not mDNS.

  • HA unavailable false alarms: The HA WebSocket listener fires on every "unavailable" transition, but the ecobee cloud API occasionally reports brief unavailable states for ecobees that are actually connected. These generate disconnect log entries and email alerts that are false positives. Always verify via conntrack/ARP before concluding an ecobee is actually down.