feat(hardening): CPU/IO slice isolation for background services #12

Open
s8n wants to merge 1 commit from feat/cpu-io-slice-isolation into feat/memory-pressure-tuning
Owner

Companion to 7d2b94b (memory-pressure tuning). Found live on a 24-thread Ryzen AI 9 HX 370 / 30 GiB workstation 2026-05-13: load avg climbed to 6.5 ~16 min after login, typing in konsole / address bar lagged 100s of ms. RAM/swap uncontended — pure CPU contention (PSI cpu some=0.34).

Root cause: every Fedora unit ships with CPUWeight=[not set] → defaults to 100. Under contention the scheduler splits CPU evenly between every leaf cgroup, so kwin_wayland and plasmashell lost scheduling fights to packagekitd + plasma-discover --mode update + fwupd-refresh + dnf-makecache running concurrently.

Three fixes:

  1. system-bg.sliceCPUWeight=20, IOWeight=50, MemoryHigh=4G. Five service drop-ins assign packagekit, fwupd, fwupd-refresh, dnf-makecache, dnf5-automatic with Nice=10, IOSchedulingClass=idle.
  2. user-.slice.d/10-boost.confCPUWeight=300, IOWeight=200 for every logged-in session. Net 15:1 interactive:background ratio under contention.
  3. Boot-storm sources defused: skel autostart shadow hides the Discover notifier (no GUI fires at session start); dnf-makecache.timer OnBootSec=20min pushes refresh past peak bring-up.

Opt-in artifact for users adding cloud-sync tools: skel user-bg.slice (CPUWeight=30). Drop a Slice=user-bg.slice drop-in on Syncthing / rclone / file-indexer service to inherit.

Verified live before opening this PR:

  • Load dropped 6.53 → 3.55 within minutes
  • systemd-cgls: packagekit.service lives under /system.slice/system-bg.slice/, syncthing.service under /user.slice/.../user-bg.slice/
  • After full reboot: zero contention regression, slices persistent

Follow-ups documented in CHANGELOG (not in this PR):

  • tuned-adm profile onyx-performance silently falls back to balanced (no errors logged). Needs CI smoke-test that asserts tuned-adm active matches request.
  • EPP regresses to balance_performance despite system on AC + charging. Manual EPP=performance + throughput-performance profile restored snappy input. Long-term: charging-aware tuned hook.
  • KWin OpenGL GL_FRAMEBUFFER_INCOMPLETE_MISSING_ATTACHMENT flood on hybrid NVIDIA RTX 4070 + AMD Radeon 890M. Cleared on session restart. Possibly KWin 6.6.4 + nvidia 580.159.03 specific.

Base branch is feat/memory-pressure-tuning (the immediate parent). Net diff for this PR is exactly the slice additions.

Companion to 7d2b94b (memory-pressure tuning). Found live on a 24-thread Ryzen AI 9 HX 370 / 30 GiB workstation 2026-05-13: load avg climbed to 6.5 ~16 min after login, typing in konsole / address bar lagged 100s of ms. RAM/swap uncontended — pure CPU contention (PSI cpu some=0.34). **Root cause:** every Fedora unit ships with `CPUWeight=[not set]` → defaults to 100. Under contention the scheduler splits CPU evenly between every leaf cgroup, so `kwin_wayland` and `plasmashell` lost scheduling fights to `packagekitd` + `plasma-discover --mode update` + `fwupd-refresh` + `dnf-makecache` running concurrently. **Three fixes:** 1. `system-bg.slice` — `CPUWeight=20`, `IOWeight=50`, `MemoryHigh=4G`. Five service drop-ins assign packagekit, fwupd, fwupd-refresh, dnf-makecache, dnf5-automatic with `Nice=10`, `IOSchedulingClass=idle`. 2. `user-.slice.d/10-boost.conf` — `CPUWeight=300`, `IOWeight=200` for every logged-in session. Net 15:1 interactive:background ratio under contention. 3. Boot-storm sources defused: skel autostart shadow hides the Discover notifier (no GUI fires at session start); `dnf-makecache.timer` `OnBootSec=20min` pushes refresh past peak bring-up. Opt-in artifact for users adding cloud-sync tools: skel `user-bg.slice` (`CPUWeight=30`). Drop a `Slice=user-bg.slice` drop-in on Syncthing / rclone / file-indexer service to inherit. **Verified live before opening this PR:** - Load dropped 6.53 → 3.55 within minutes - `systemd-cgls`: `packagekit.service` lives under `/system.slice/system-bg.slice/`, `syncthing.service` under `/user.slice/.../user-bg.slice/` - After full reboot: zero contention regression, slices persistent **Follow-ups documented in CHANGELOG (not in this PR):** - `tuned-adm profile onyx-performance` silently falls back to `balanced` (no errors logged). Needs CI smoke-test that asserts `tuned-adm active` matches request. - EPP regresses to `balance_performance` despite system on AC + charging. Manual `EPP=performance` + `throughput-performance` profile restored snappy input. Long-term: charging-aware tuned hook. - KWin OpenGL `GL_FRAMEBUFFER_INCOMPLETE_MISSING_ATTACHMENT` flood on hybrid NVIDIA RTX 4070 + AMD Radeon 890M. Cleared on session restart. Possibly KWin 6.6.4 + nvidia 580.159.03 specific. Base branch is `feat/memory-pressure-tuning` (the immediate parent). Net diff for this PR is exactly the slice additions.
s8n added 1 commit 2026-05-13 11:12:45 +01:00
feat(hardening): CPU/IO slice isolation for background services
Some checks failed
Lint / Kickstart syntax (pull_request) Has been cancelled
Lint / Shell scripts (pull_request) Has been cancelled
Lint / No personal/onyx leaks (pull_request) Has been cancelled
c6f65f0831
Companion to the memory-pressure tuning (7d2b94b). Memory was only
half the "expensive laptop typing like a Chromebook" story — once
zram-only OOM thrash was solved, a second symptom class emerged:
post-boot CPU/IO contention on machines with high core counts.

Live incident on a 24-thread Ryzen AI 9 HX 370 / 30 GiB workstation,
2026-05-13: ~16 min after login, load avg 6.5, typing in konsole and
the address bar lagged 100s of ms. RAM/swap uncontended (8 GiB/30 GiB
used, zero swap), so the memory tuning was holding. PSI showed
cpu some=0.34 — pure scheduler contention.

Root cause: every Fedora unit ships with CPUWeight=[not set] which
maps to weight=100. Under contention the kernel splits CPU evenly
between every leaf cgroup. With the post-boot storm running
concurrently (plasma-discover ~80%, packagekitd ~33%, fwupd ~20%,
dnf-makecache firing) the compositor (kwin_wayland, plasmashell) was
losing scheduling fights against package metadata.

Three fixes shipped together:

1. system-bg.slice — CPUWeight=20, IOWeight=50, MemoryHigh=4G. Five
   service drop-ins assign packagekit, fwupd, fwupd-refresh,
   dnf-makecache, dnf5-automatic into it with Nice=10 and
   IOSchedulingClass=idle. Proportional, not a hard cap — idle
   systems still get full speed.

2. user-.slice.d/10-boost.conf — CPUWeight=300, IOWeight=200 on every
   logged-in user session. Combined with above gives a 15:1
   interactive:background ratio under contention.

3. Boot-storm sources defused: skel autostart shadow disables the
   discover update notifier auto-launch; dnf-makecache.timer
   OnBootSec=20min pushes metadata refresh past peak session
   bring-up.

One opt-in artifact: skel user-bg.slice (CPUWeight=30) for anyone
installing Syncthing, rclone, or a file indexer — drop a
Slice=user-bg.slice drop-in on the service to inherit the same
protection at the user level.

Verified live before opening this PR: load dropped 6.53 -> 3.55
within minutes of applying; cgroup placement confirmed via
systemd-cgls.

Follow-up filed in CHANGELOG (not in this PR): tuned-adm
"onyx-performance" profile silently falls back to balanced, and
EPP regresses to balance_performance on AC. Needs separate branch.
Some checks failed
Lint / Kickstart syntax (pull_request) Has been cancelled
Lint / Shell scripts (pull_request) Has been cancelled
Lint / No personal/onyx leaks (pull_request) Has been cancelled
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin feat/cpu-io-slice-isolation:feat/cpu-io-slice-isolation
git checkout feat/cpu-io-slice-isolation

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git checkout feat/memory-pressure-tuning
git merge --no-ff feat/cpu-io-slice-isolation
git checkout feat/cpu-io-slice-isolation
git rebase feat/memory-pressure-tuning
git checkout feat/memory-pressure-tuning
git merge --ff-only feat/cpu-io-slice-isolation
git checkout feat/cpu-io-slice-isolation
git rebase feat/memory-pressure-tuning
git checkout feat/memory-pressure-tuning
git merge --no-ff feat/cpu-io-slice-isolation
git checkout feat/memory-pressure-tuning
git merge --squash feat/cpu-io-slice-isolation
git checkout feat/memory-pressure-tuning
git merge --ff-only feat/cpu-io-slice-isolation
git checkout feat/memory-pressure-tuning
git merge feat/cpu-io-slice-isolation
git push origin feat/memory-pressure-tuning
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: veilor-org/veilor-os#12
No description provided.