feat(hardening): CPU/IO slice isolation for background services #12

Open
s8n wants to merge 1 commit from feat/cpu-io-slice-isolation into feat/memory-pressure-tuning

1 commit

Author SHA1 Message Date
veilor-org
c6f65f0831 feat(hardening): CPU/IO slice isolation for background services
Some checks failed
Lint / Kickstart syntax (pull_request) Has been cancelled
Lint / Shell scripts (pull_request) Has been cancelled
Lint / No personal/onyx leaks (pull_request) Has been cancelled
Companion to the memory-pressure tuning (7d2b94b). Memory was only
half the "expensive laptop typing like a Chromebook" story — once
zram-only OOM thrash was solved, a second symptom class emerged:
post-boot CPU/IO contention on machines with high core counts.

Live incident on a 24-thread Ryzen AI 9 HX 370 / 30 GiB workstation,
2026-05-13: ~16 min after login, load avg 6.5, typing in konsole and
the address bar lagged 100s of ms. RAM/swap uncontended (8 GiB/30 GiB
used, zero swap), so the memory tuning was holding. PSI showed
cpu some=0.34 — pure scheduler contention.

Root cause: every Fedora unit ships with CPUWeight=[not set] which
maps to weight=100. Under contention the kernel splits CPU evenly
between every leaf cgroup. With the post-boot storm running
concurrently (plasma-discover ~80%, packagekitd ~33%, fwupd ~20%,
dnf-makecache firing) the compositor (kwin_wayland, plasmashell) was
losing scheduling fights against package metadata.

Three fixes shipped together:

1. system-bg.slice — CPUWeight=20, IOWeight=50, MemoryHigh=4G. Five
   service drop-ins assign packagekit, fwupd, fwupd-refresh,
   dnf-makecache, dnf5-automatic into it with Nice=10 and
   IOSchedulingClass=idle. Proportional, not a hard cap — idle
   systems still get full speed.

2. user-.slice.d/10-boost.conf — CPUWeight=300, IOWeight=200 on every
   logged-in user session. Combined with above gives a 15:1
   interactive:background ratio under contention.

3. Boot-storm sources defused: skel autostart shadow disables the
   discover update notifier auto-launch; dnf-makecache.timer
   OnBootSec=20min pushes metadata refresh past peak session
   bring-up.

One opt-in artifact: skel user-bg.slice (CPUWeight=30) for anyone
installing Syncthing, rclone, or a file indexer — drop a
Slice=user-bg.slice drop-in on the service to inherit the same
protection at the user level.

Verified live before opening this PR: load dropped 6.53 -> 3.55
within minutes of applying; cgroup placement confirmed via
systemd-cgls.

Follow-up filed in CHANGELOG (not in this PR): tuned-adm
"onyx-performance" profile silently falls back to balanced, and
EPP regresses to balance_performance on AC. Needs separate branch.
2026-05-13 10:15:35 +01:00