自宅鯖が早朝にDownしたので調べたら、Cloud Opus 4.7がRHEL10.1のNetwork Managerのバグを見付けたが、レポート先が不便なのでブチ切れて放流する話

いや、タイトルで全部説明しているんだけど、布団の中でiPadでこのブログを表示しようとしたら出ない。寝床から出てGrafanaを見るとDown表示。ところが、MacStudioのEthernetだと見える、という不可思議な状態に。セカンダリDNSにログインして、curl https://rio.st/しても、dig @rio.st rio.stしても見えない。

自宅鯖は10G NICとWiFiの2系統のアクセスがあって、WiFiはStarlink内蔵のNTPサーバにsyncするために有効化してある状態。

で、Claude Opus 4.7と一緒に調べ始めたら、下記結論に至った。

結論

Network Manager の commit path nm_platform_ip_route_sync() が、セマンティック比較で RTNH_F_LINKDOWN/RTNH_F_DEAD を見ないため、「kernel に dead 化された残骸 route」と「設定済み route」を等価判定して continue; し、再投入を skip するというバグがあり、RHEL10.1はともかく、upstreamでも再現する。

Summary

NetworkManager silently fails to install the IPv4 default route after
nmcli con up on a wired profile when the underlying driver
(observed: Marvell/Aquantia AQC107 atlantic) asserts a transient
linkdown during the UP transition. NM logs policy: set <conn> as default for IPv4 routing and DNS, its internal l3cfg state shows
the route as nm-configured, in-platform, but the kernel has no
matching default route in ip route. Outbound IPv4 from the host
then fails with Network is unreachable. Inbound traffic still
works via on-link routes, which hides the regression from naive
reachability checks.

Affected versions

  • NetworkManager-1.54.0-3.el10_1.x86_64 (RHEL 10.1, current).
  • The buggy code path (nm_platform_ip_route_sync() in
    src/libnm-platform/nm-platform.c) is unchanged in upstream
    NetworkManager main (1.58 dev) and in 1.56.x, so all releases
    from at least 1.54 through 1.58 are expected to be affected on
    hardware that exhibits transient carrier flaps during link UP.
  • Earlier versions were not specifically tested but are likely
    affected too: the responsible code (semantical route comparison
    ignoring RTNH_F_LINKDOWN) predates 1.54.

Affected hardware / driver

  • NIC: Aquantia AQC107 (10GbE)
  • Driver: atlantic, kernel 6.12.0-124.52.1.el10_1.x86_64,
    firmware-version 1.5.44
  • The specific driver matters only in that it produces the transient
    linkdown. The NM-side bug is generic: any driver that briefly
    asserts linkdown during a UP transition on an interface whose
    NM-configured connection has a default route can trigger this.

Severity rationale

When this hits, the host stops being able to make outbound IPv4
connections. On a server whose role includes mail submission to
external relays, secondary DNS reach to 160.251.47.244, blackbox
HTTP probing to public service URLs, and outbound LE ACME requests,
the outage is total for IPv4 outbound until manual recovery. Inbound
traffic continues to work, so external reachability checks that only
verify "the site responds" do not catch it; the operator typically
notices via Grafana / blackbox / ACME failures, often hours later.

Steps to reproduce

  1. RHEL 10.1 host with NetworkManager 1.54.0-3.el10_1 and an
    atlantic-driven NIC.
  2. NetworkManager keyfile profile Wired connection 2 with
    ipv4.method=manual, an explicit ipv4.gateway=<lan-gw>, and
    ipv4.route-metric=-1 (auto, the default). The bug is also
    reproducible with ipv4.route-metric=100 explicitly set.
  3. Run:
   sudo nmcli general logging level TRACE \
       domains PLATFORM,IP4,DEVICE,CORE,DISPATCH
   sudo nmcli con up "Wired connection 2"
  1. Immediately check:
   ip -4 route show default
   ip -4 route get 1.1.1.1
   curl -4 -m 5 https://api.ipify.org/

Actual result

$ ip -4 route show default
(empty)
$ ip -4 route get 1.1.1.1
RTNETLINK answers: Network is unreachable
$ curl -4 -m 5 https://api.ipify.org/
curl: (7) Failed to connect to api.ipify.org port 443 after 4 ms:
        Could not connect to server

The NetworkManager journal (TRACE) shows the activation completing
successfully, the policy decision setting the wired connection as
default for IPv4 routing, the platform layer first attempting to
delete an old default with rt-src rt-static rtm_flags linkdown and
getting failure 3 (No such process) (ESRCH), l3cfg pruning the
route as a zombie, and then NM concluding that the route is
in-platform again — without ever emitting an RTM_NEWROUTE for
0.0.0.0/0.

Representative trace excerpt (timestamps and obfuscated pointers
trimmed):

device (enp2s0): state change: ip-config -> ip-check
policy: set 'Wired connection 2' (enp2s0) as default for IPv4 routing and DNS
l3cfg: obj-state: now zombie:
        [ip4-route, type unicast 0.0.0.0/0 via 192.168.1.1 dev 3
         metric 100 mss 0 rt-src user],
        zombie[5], nm-configured, in-platform
platform-linux: event-notification: RTM_NEWROUTE flags multi,dump_filtered:
        type unicast 0.0.0.0/0 via 192.168.1.1 dev 3 metric 100 mss 0
        rt-src rt-boot scope global
l3cfg: obj-state: zombie pruned during reapply: [...]
platform: (enp2s0) ip4-route: delete type unicast 0.0.0.0/0 via 192.168.1.1
        dev 3 metric 100 mss 0 rt-src rt-static rtm_flags linkdown scope global
platform-linux: do-delete-ip4-route[...rtm_flags linkdown...]:
        failure 3 (No such process), meaning the object was already removed
platform: (enp2s0) ip4-route: delete type unicast 0.0.0.0/0 via 192.168.1.1
        dev 3 metric 100 mss 0 rt-src rt-boot scope global
platform-linux: do-delete-ip4-route[...rt-src rt-boot...]:
        failure 3 (No such process), meaning the object was already removed
l3cfg: obj-state: prune zombie: [...rt-src user...]
        zombie[3], nm-configured, in-platform
l3cfg: obj-state: update: [...0.0.0.0/0 via 192.168.1.1 dev 3 metric 100...]
        nm-configured, in-platform (static: 1, dynamic: 0)
l3cfg:    route4[0]: [DEFAULT] type unicast 0.0.0.0/0 via 192.168.1.1
                     dev 3 metric 100 mss 0 rt-src user
platform-linux: do-add-ip4-route[type unicast 192.168.1.1/32 dev 3
                                  metric 100 ...]: success
policy: set 'Wired connection 2' (enp2s0) as default for IPv4 routing and DNS

Note that the only do-add-ip4-route[...]: success after the policy
decision is the 192.168.1.1/32 on-link route to the gateway. There
is no do-add-ip4-route[type unicast 0.0.0.0/0 ...] line. NM
considers the default route done.

Expected result

After nmcli con up "Wired connection 2" returns success, the
kernel routing table contains an effective IPv4 default route via
192.168.1.1 on enp2s0, and outbound IPv4 from the host works.

Root cause analysis

nm_platform_ip_route_sync() in
src/libnm-platform/nm-platform.c iterates over the configured
routes and, for each one, looks up the corresponding entry in the
NM platform cache:

plat_entry = nm_platform_lookup_entry(self, NMP_CACHE_ID_TYPE_OBJECT_TYPE, conf_o);
if (plat_entry) {
    const NMPObject *plat_o = plat_entry->obj;

    if (vt->route_cmp(NMP_OBJECT_CAST_IPX_ROUTE(conf_o),
                      NMP_OBJECT_CAST_IPX_ROUTE(plat_o),
                      NM_PLATFORM_IP_ROUTE_CMP_TYPE_SEMANTICALLY) == 0)
        continue;
    ...
}

The SEMANTICALLY comparator in nm_platform_ip4_route_cmp()
compares only the bits RTM_F_CLONED | RTNH_F_ONLINK of
r_rtm_flags:

case NM_PLATFORM_IP_ROUTE_CMP_TYPE_SEMANTICALLY:
    ...
    NM_CMP_DIRECT(a->r_rtm_flags & (RTM_F_CLONED | RTNH_F_ONLINK),
                  b->r_rtm_flags & (RTM_F_CLONED | RTNH_F_ONLINK));

RTNH_F_LINKDOWN and RTNH_F_DEAD are not part of this mask, and
intentionally so for the comparator's main use cases (those bits are
kernel-side runtime state, not route identity). However, in the
route_sync() skip path the consequence is that an NM-configured
route compares semantically equal to a kernel cache entry that the
kernel has marked dead.

When atlantic (or any driver that briefly asserts linkdown on
UP) causes the kernel to set RTNH_F_LINKDOWN on the previously
installed default route, that dead entry stays in both the kernel's
FIB cache and NM's platform cache, the new commit's
route_cmp(SEMANTICALLY) == 0 triggers the continue;, and NM
never re-adds the route. The kernel does not use a dead-flagged
nexthop for forwarding, so the host loses outbound IPv4.

The earlier do-delete-ip4-route calls in the trace return ESRCH
because by the time NM tries to delete the dead remnant, the kernel
has already cleaned it up internally during the link transition; the
RTM_NEWROUTE that should follow is the one NM never emits.

This is not a kernel/driver bug. The kernel's behaviour
(marking nexthops dead on linkdown, and not using them) is correct
and load-bearing. The bug is in NetworkManager's commit path
believing that the dead entry satisfies the configured route.

Suggested fix

In nm_platform_ip_route_sync(), when the configured and platform
routes compare semantically equal, additionally check the platform
copy's r_rtm_flags. If RTNH_F_LINKDOWN or RTNH_F_DEAD is set,
do not continue; — fall through to the existing
delete-then-NMP_NLM_FLAG_APPEND path so the route is re-added with
fresh, non-dead nexthop state. The existing EEXIST handling in
the nm_platform_ip_route_add() follow-up already covers the
"another agent reinstalled it between delete and add" race.

The semantical comparator itself should not be changed. Adding
RTNH_F_LINKDOWN / RTNH_F_DEAD to the SEMANTICALLY mask would
ripple into route hashing and other comparator users that
intentionally treat those flags as transient.

A patch implementing this fix is attached as
0001-platform-re-add-routes-that-kernel-marked-RTNH_F_LIN.patch
(see also
upstream-patches/NetworkManager/0001-platform-re-add-routes-that-kernel-marked-RTNH_F_LIN.patch
in the local repo). The patch is ~22 lines, fully contained in
nm_platform_ip_route_sync(), and emits a single TRACE log line
when it forces a re-add, so the new behaviour is observable in the
existing logging facility without further changes.

Operational mitigation in use

Until a fixed NetworkManager package is available, the affected host
runs a small Ansible-managed watchdog under
new_server/ansible-playbooks/network/:

  • /usr/local/sbin/ensure-default-route.sh — idempotent helper that
    re-adds default via <lan-gw> dev <iface> metric 100 only when
    no IPv4 default route is present. Never overrides an existing
    default route.
  • ensure-default-route.service — oneshot at boot.
  • ensure-default-route.timerOnCalendar=*:0/5,
    Persistent=true, AccuracySec=30s. Re-checks every 5 minutes
    so a missing default route is repaired within 5 minutes
    regardless of cause.

This watchdog also makes the bug diagnosable: the helper logs to
syslog with the tag ensure-default-route, so any time the kernel
default route disappears between nmcli invocations, an entry
appears in the journal.

The watchdog is intended to be retired once a fixed NetworkManager
package is available and verified.

Additional information

  • I am willing to test scratch builds. The reproduction is
    deterministic on the affected hardware: nmcli con up "Wired connection 2" reproduces the failure on every invocation.
  • I have a pre-bug TRACE capture and a post-failure capture from
    the same host for direct comparison; happy to attach as
    separate Jira issue attachments on request, redacted of
    unrelated user data.
  • The Wired connection 2 profile in question is plain manual IPv4
    with two addresses on the same /24 (192.168.1.3/24 primary and
    192.168.1.4/24 secondary, both on enp2s0). The host also
    carries an unrelated fujita_starlink Wi-Fi profile with
    ipv4.never-default=yes used solely to reach a vendor NTP server
    on a directly attached Wi-Fi network. Two obsolete NM keyfile
    profiles (ens18, fujita_wifi6) that referenced absent
    interfaces / SSIDs were removed during diagnosis as a
    housekeeping step; the bug reproduces with or without them
    present.
  • The bug is independent of ignore_routes_with_linkdown. On this
    host:
  net.ipv4.conf.all.ignore_routes_with_linkdown = 0
  net.ipv4.conf.default.ignore_routes_with_linkdown = 0
  net.ipv4.conf.enp2s0.ignore_routes_with_linkdown = 0

i.e. the kernel default. Setting this sysctl to 0 does not
prevent RTNH_F_LINKDOWN from being set on the cached route; it
only changes whether the kernel hides such routes from
user-space lookups. The NM commit-path bug fires regardless.

  • I checked upstream NEWS for NetworkManager 1.54, 1.56, and 1.58
    and could not find a release note describing a fix for this
    specific code path.