From 7d2b94b5bec16b952075d3959e74edd1eb9f67b5 Mon Sep 17 00:00:00 2001 From: veilor-org Date: Tue, 12 May 2026 10:17:00 +0100 Subject: [PATCH] feat(hardening): add memory-pressure tuning for zram-only stack MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit veilor-os runs zram-only swap (THREAT-MODEL.md — no key leak from disk swap). With kernel defaults that policy bites: once zram fills there is no overflow tier, the kernel waits until total exhaustion to trigger OOM, then picks a victim by oom_score and frequently kills plasmashell or the foreground terminal instead of the leaking browser tab. Mouse locks for minutes during the thrash window. Three co-dependent layers: 1. systemd-oomd enabled — PSI-based pre-OOM killer fires at cgroup boundaries before the kernel reaper. Fedora's systemd-oomd-defaults ship sane thresholds for user.slice; installed in kickstart and layered in bluebuild containerfile, enabled in both unit-toggle blocks. 2. zram bumped 8 GiB lzo-rle (Fedora default) -> 16 GiB zstd. zstd gives ~3:1 (~48 GiB effective) at negligible CPU cost on any post-2018 x86_64. 8 GiB filled in practice on 32+ GiB laptops running Chromium + LSP + chat clients. 3. /etc/sysctl.d/95-memory-pressure.conf: - vm.swappiness=180 (zram is RAM-fast, swap early; default 60 assumes HDD) - vm.watermark_scale_factor=125 (kswapd reclaim starts ~1.25% headroom vs default 0.1%; ~400 MiB head start on 32 GiB) - vm.page-cluster=0 (no read-ahead; pointless on RAM-backed swap, wastes decompress) Without any one of the three the system still wedges briefly: oomd without zram tuning waits for PSI to climb; zram tuning without oomd gets victim selection wrong. Verified by new test/boot-checklist.md "Memory pressure" section. Inline rationale headers in both overlay files so the why survives doc drift. Trigger event: onyx (Fedora 43, not veilor-os) thrashed 2026-05-11; same defaults shipped to veilor-os, fixed here too. --- bluebuild/recipe.yml | 2 + docs/HARDENING.md | 29 +++++++++++++ kickstart/veilor-os.ks | 31 +++++++++++++- overlay/etc/sysctl.d/95-memory-pressure.conf | 45 ++++++++++++++++++++ overlay/etc/systemd/zram-generator.conf | 19 +++++++++ test/boot-checklist.md | 12 ++++++ 6 files changed, 136 insertions(+), 2 deletions(-) create mode 100644 overlay/etc/sysctl.d/95-memory-pressure.conf create mode 100644 overlay/etc/systemd/zram-generator.conf diff --git a/bluebuild/recipe.yml b/bluebuild/recipe.yml index 7fd1287..aff9cc0 100644 --- a/bluebuild/recipe.yml +++ b/bluebuild/recipe.yml @@ -126,6 +126,7 @@ modules: tailscale \ yggdrasil \ zram-generator \ + systemd-oomd-defaults \ jq \ vim-enhanced \ tmux \ @@ -152,6 +153,7 @@ modules: systemctl enable veilor-modules-lock.service 2>/dev/null || true ; \ systemctl enable veilor-postinstall.service 2>/dev/null || true ; \ systemctl enable veilor-doctor.timer 2>/dev/null || true ; \ + systemctl enable systemd-oomd.service 2>/dev/null || true ; \ } ; \ rpm-ostree cleanup -m ; \ ostree container commit diff --git a/docs/HARDENING.md b/docs/HARDENING.md index 6427cac..406ea42 100644 --- a/docs/HARDENING.md +++ b/docs/HARDENING.md @@ -188,6 +188,35 @@ Splunk via HEC bridge. ## What's *not* enabled by default - **Disk swap** — replaced by zram (RAM-only, no key leak risk). + +## Memory pressure + +veilor-os runs **zram-only swap** (see THREAT-MODEL.md — keeps cleartext +session keys out of any persistent allocation that would survive +suspend-to-disk or a yanked drive). That stance has a sharp edge: once +zram fills, there is no overflow tier. With stock kernel defaults the +result is a multi-minute thrash — input compositor frozen, mouse stuck, +keyboard ignored — followed by a kernel OOM kill that picks the wrong +victim (often `plasmashell` or the foreground terminal) because the +runaway browser tab has a lower oom_score than the long-lived session +process. The user's desktop dies; the leaking app survives. + +Three layers of mitigation ship by default: + +| Layer | File | What it does | Failure mode if absent | +|-------|------|--------------|------------------------| +| **systemd-oomd** | enabled in `kickstart/veilor-os.ks` `%post` and in `bluebuild/recipe.yml` unit-toggle RUN | PSI-based pre-OOM killer — picks the cgroup under highest memory+IO pressure and terminates it *before* the kernel's global reaper fires. Reads from `/proc/pressure/*`, kills at the cgroup boundary so siblings survive. | Kernel waits until total exhaustion. Picks by oom_score → plasmashell / terminal die, browser tab keeps leaking. Mouse locks during the wait. | +| **zram-generator** override | `overlay/etc/systemd/zram-generator.conf` (and matching `%post` write) | 16 GiB compressed with `zstd` (~3:1 → ~48 GiB effective). Replaces Fedora default 8 GiB / lzo-rle. | 8 GiB fills under sustained pressure on 32+ GiB laptops running Chromium + LSP + chat. No overflow (no disk swap) → straight to OOM. | +| **vm.* sysctl** | `overlay/etc/sysctl.d/95-memory-pressure.conf` | `swappiness=180` (use zram early — it's RAM-fast), `watermark_scale_factor=125` (kswapd starts reclaim ~1.25 % headroom vs default 0.1 %), `page-cluster=0` (no read-ahead — pointless on RAM-backed swap, wastes decompress cycles). | Defaults `60 / 10 / 3` assume slow HDD swap. Kernel refuses to swap until allocations stall in direct-reclaim → thrash window before either oomd or kernel OOM acts. | + +All three are co-dependent: oomd without zram tuning still wedges +briefly waiting for PSI to climb; zram tuning without oomd still gets +kernel-OOM victim selection wrong. Verified by `test/boot-checklist.md` +"Memory pressure" section. + +Layer rationale logged in `overlay/etc/sysctl.d/95-memory-pressure.conf` +and `overlay/etc/systemd/zram-generator.conf` headers — kept inline so +the *why* survives even if this doc is deleted. - **Bluetooth** — disabled. Enable with `systemctl enable --now bluetooth`. - **Printing** — CUPS removed. Reinstall if needed: `dnf install cups`. - **Snapd, Flatpak** — not installed (Flatpak optional add-on). diff --git a/kickstart/veilor-os.ks b/kickstart/veilor-os.ks index 53a7bc1..db33741 100644 --- a/kickstart/veilor-os.ks +++ b/kickstart/veilor-os.ks @@ -271,14 +271,41 @@ sed -i \ plymouth-set-default-theme details 2>/dev/null || true [ -f /boot/grub2/grub.cfg ] && grub2-mkconfig -o /boot/grub2/grub.cfg 2>/dev/null || true -# zram swap (no disk swap; keys never leak to platter) +# zram swap (no disk swap; keys never leak to platter). +# +# Sizing: 16 GiB compressed (zstd ~3:1 → ~48 GiB effective). Default 8G +# filled under sustained pressure on 32+ GiB laptops running browsers + +# LSP + chat → kernel OOM (no disk-swap fallback per threat model). See +# overlay/etc/systemd/zram-generator.conf and docs/HARDENING.md "Memory +# pressure" for full rationale. dnf install -y zram-generator || true cat > /etc/systemd/zram-generator.conf << 'EOF' [zram0] -zram-size = min(ram, 8192) +zram-size = min(ram, 16384) compression-algorithm = zstd EOF +# Memory-pressure sysctl tuning for zram-only stack. Default vm.swappiness +# assumes a slow disk; on zram the kernel must be told to swap early +# (180) and reclaim early (watermark_scale_factor=125) so it never gets +# cornered into kernel-OOM. page-cluster=0 disables read-ahead which is +# pointless on RAM-backed swap. See overlay/etc/sysctl.d/95-memory-pressure.conf +# and docs/HARDENING.md "Memory pressure" for the rationale + failure mode. +cat > /etc/sysctl.d/95-memory-pressure.conf << 'EOF' +vm.swappiness = 180 +vm.watermark_scale_factor = 125 +vm.page-cluster = 0 +EOF + +# systemd-oomd: userspace OOM killer that uses PSI (pressure stall info) +# to pick a victim cgroup BEFORE the kernel's global OOM reaper fires. +# Without oomd the kernel waits until total exhaustion then picks by +# oom_score, often killing plasmashell or the active terminal instead of +# the runaway browser tab. Fedora ships systemd-oomd-defaults with sane +# thresholds for user.slice cgroups. +dnf install -y systemd-oomd-defaults || true +systemctl enable systemd-oomd.service || true + # Patch anaconda's transaction_progress.py inside the live rootfs so that # when the user clicks "Install", a non-fatal RPM 6.0 *scriptlet* warning # does not get escalated to "An error occurred during the transaction" diff --git a/overlay/etc/sysctl.d/95-memory-pressure.conf b/overlay/etc/sysctl.d/95-memory-pressure.conf new file mode 100644 index 0000000..e2f3205 --- /dev/null +++ b/overlay/etc/sysctl.d/95-memory-pressure.conf @@ -0,0 +1,45 @@ +# veilor-os — memory-pressure tuning for zram-only swap +# +# Rationale: veilor-os ships zram swap with NO disk swap (see THREAT-MODEL.md +# §"Lost or stolen laptop"). The kernel's default vm.* knobs assume a slow +# spinning disk and refuse to swap until physical RAM is nearly exhausted. +# Under a zram-only stack that policy is wrong on two axes: +# +# 1. zram is RAM-fast — there is no penalty for swapping early, only a +# small CPU cost for zstd compress/decompress. +# 2. Once zram fills, there is no overflow (no disk swap by design), so +# the kernel falls through to OOM. With default knobs the OOM trigger +# is slow and reactive: by the time it fires, the system has spent +# minutes in thrash (compositor/input frozen, mouse stuck) and the +# kernel picks a victim by oom_score which is often plasmashell or +# the terminal — i.e. the user's session goes down, not the runaway. +# +# What these knobs do: +# +# vm.swappiness = 180 +# Tell the kernel to prefer evicting anonymous pages to (zram) swap +# over reclaiming file-backed pages. Fedora's zram-generator upstream +# recommends 180 for zram-only systems. Default 60 is tuned for HDD +# swap and leaves zram unused until too late. +# +# vm.watermark_scale_factor = 125 +# Start kswapd reclaim earlier (~1.25% of RAM headroom vs default +# 0.1%). On a 32 GiB box that's ~400 MiB head start before allocations +# would otherwise stall in direct-reclaim. Trades a tiny amount of +# usable RAM for much smoother latency under bursty allocators +# (Chromium/Electron tab spawns, language server warm-up). +# +# vm.page-cluster = 0 +# Read one page per swap-in instead of the default 8. Read-ahead is a +# win on rotational media because seeks dominate; on zram the seek +# cost is zero and grabbing 7 extra pages just wastes decompress +# cycles and CPU cache. Setting to 0 is the documented zram tuning. +# +# Companion: systemd-oomd is enabled in the same change so PSI-based +# pre-OOM kills land on the right cgroup before the kernel OOM reaper +# fires. Without it, even with these knobs the system can still wedge +# briefly while the kernel waits for the global watermark. + +vm.swappiness = 180 +vm.watermark_scale_factor = 125 +vm.page-cluster = 0 diff --git a/overlay/etc/systemd/zram-generator.conf b/overlay/etc/systemd/zram-generator.conf new file mode 100644 index 0000000..f7079b9 --- /dev/null +++ b/overlay/etc/systemd/zram-generator.conf @@ -0,0 +1,19 @@ +# veilor-os — zram swap override +# +# Replaces the Fedora default config (which would otherwise set +# zram-size = min(ram, 8192) with whatever compression algorithm +# zram-generator picked, historically lzo-rle). +# +# Sizing rationale: 16 GiB compressed (typical 3:1 with zstd → ~48 GiB +# effective). Default 8 GiB filled under sustained pressure on modern +# 32+ GiB laptops running browsers + LSP + chat clients, leaving the +# kernel with no swap headroom and triggering OOM (since veilor-os has +# no disk swap fallback — see THREAT-MODEL.md "no key leak risk"). +# +# Algorithm: zstd. lzo-rle is faster but ratio ~2:1; zstd is ~3:1 with +# negligible CPU cost on any post-2018 x86_64. The extra 50% effective +# swap capacity is worth more than the microseconds of compress time. + +[zram0] +zram-size = min(ram, 16384) +compression-algorithm = zstd diff --git a/test/boot-checklist.md b/test/boot-checklist.md index 2621858..38ecb35 100644 --- a/test/boot-checklist.md +++ b/test/boot-checklist.md @@ -91,6 +91,18 @@ before the build is considered green. - [ ] `lsblk -f` shows LUKS2 on the main partition - [ ] `cryptsetup luksDump /dev/...` shows argon2id, aes-xts-plain64 - [ ] `swapon` shows `zram` device, no disk swap +- [ ] `zramctl` shows `ALGORITHM=zstd` and `DISKSIZE=16G` (= 16 GiB, + not Fedora's 8 GiB default — see `overlay/etc/systemd/zram-generator.conf`) + +## Memory pressure + +- [ ] `systemctl is-active systemd-oomd` → `active` (PSI-based pre-OOM + killer; without it the kernel waits until total RAM exhaustion + then often kills plasmashell or the active terminal instead of + the runaway tab) +- [ ] `sysctl vm.swappiness vm.watermark_scale_factor vm.page-cluster` + shows `180 / 125 / 0` (default `60 / 10 / 3` is wrong for + zram-only — kernel refuses to swap until exhausted, then thrashes) ## SELinux module