feat(hardening): add memory-pressure tuning for zram-only stack

veilor-os runs zram-only swap (THREAT-MODEL.md — no key leak from
disk swap). With kernel defaults that policy bites: once zram fills
there is no overflow tier, the kernel waits until total exhaustion
to trigger OOM, then picks a victim by oom_score and frequently
kills plasmashell or the foreground terminal instead of the leaking
browser tab. Mouse locks for minutes during the thrash window.

Three co-dependent layers:

1. systemd-oomd enabled — PSI-based pre-OOM killer fires at cgroup
   boundaries before the kernel reaper. Fedora's systemd-oomd-defaults
   ship sane thresholds for user.slice; installed in kickstart and
   layered in bluebuild containerfile, enabled in both unit-toggle
   blocks.

2. zram bumped 8 GiB lzo-rle (Fedora default) -> 16 GiB zstd. zstd
   gives ~3:1 (~48 GiB effective) at negligible CPU cost on any
   post-2018 x86_64. 8 GiB filled in practice on 32+ GiB laptops
   running Chromium + LSP + chat clients.

3. /etc/sysctl.d/95-memory-pressure.conf:
   - vm.swappiness=180 (zram is RAM-fast, swap early; default 60
     assumes HDD)
   - vm.watermark_scale_factor=125 (kswapd reclaim starts ~1.25%
     headroom vs default 0.1%; ~400 MiB head start on 32 GiB)
   - vm.page-cluster=0 (no read-ahead; pointless on RAM-backed swap,
     wastes decompress)

Without any one of the three the system still wedges briefly: oomd
without zram tuning waits for PSI to climb; zram tuning without oomd
gets victim selection wrong.

Verified by new test/boot-checklist.md "Memory pressure" section.
Inline rationale headers in both overlay files so the why survives
doc drift. Trigger event: onyx (Fedora 43, not veilor-os) thrashed
2026-05-11; same defaults shipped to veilor-os, fixed here too.
This commit is contained in:
veilor-org 2026-05-12 10:17:00 +01:00
parent 505b5f0006
commit 7d2b94b5be
6 changed files with 136 additions and 2 deletions

View file

@ -126,6 +126,7 @@ modules:
tailscale \
yggdrasil \
zram-generator \
systemd-oomd-defaults \
jq \
vim-enhanced \
tmux \
@ -152,6 +153,7 @@ modules:
systemctl enable veilor-modules-lock.service 2>/dev/null || true ; \
systemctl enable veilor-postinstall.service 2>/dev/null || true ; \
systemctl enable veilor-doctor.timer 2>/dev/null || true ; \
systemctl enable systemd-oomd.service 2>/dev/null || true ; \
} ; \
rpm-ostree cleanup -m ; \
ostree container commit

View file

@ -188,6 +188,35 @@ Splunk via HEC bridge.
## What's *not* enabled by default
- **Disk swap** — replaced by zram (RAM-only, no key leak risk).
## Memory pressure
veilor-os runs **zram-only swap** (see THREAT-MODEL.md — keeps cleartext
session keys out of any persistent allocation that would survive
suspend-to-disk or a yanked drive). That stance has a sharp edge: once
zram fills, there is no overflow tier. With stock kernel defaults the
result is a multi-minute thrash — input compositor frozen, mouse stuck,
keyboard ignored — followed by a kernel OOM kill that picks the wrong
victim (often `plasmashell` or the foreground terminal) because the
runaway browser tab has a lower oom_score than the long-lived session
process. The user's desktop dies; the leaking app survives.
Three layers of mitigation ship by default:
| Layer | File | What it does | Failure mode if absent |
|-------|------|--------------|------------------------|
| **systemd-oomd** | enabled in `kickstart/veilor-os.ks` `%post` and in `bluebuild/recipe.yml` unit-toggle RUN | PSI-based pre-OOM killer — picks the cgroup under highest memory+IO pressure and terminates it *before* the kernel's global reaper fires. Reads from `/proc/pressure/*`, kills at the cgroup boundary so siblings survive. | Kernel waits until total exhaustion. Picks by oom_score → plasmashell / terminal die, browser tab keeps leaking. Mouse locks during the wait. |
| **zram-generator** override | `overlay/etc/systemd/zram-generator.conf` (and matching `%post` write) | 16 GiB compressed with `zstd` (~3:1 → ~48 GiB effective). Replaces Fedora default 8 GiB / lzo-rle. | 8 GiB fills under sustained pressure on 32+ GiB laptops running Chromium + LSP + chat. No overflow (no disk swap) → straight to OOM. |
| **vm.* sysctl** | `overlay/etc/sysctl.d/95-memory-pressure.conf` | `swappiness=180` (use zram early — it's RAM-fast), `watermark_scale_factor=125` (kswapd starts reclaim ~1.25 % headroom vs default 0.1 %), `page-cluster=0` (no read-ahead — pointless on RAM-backed swap, wastes decompress cycles). | Defaults `60 / 10 / 3` assume slow HDD swap. Kernel refuses to swap until allocations stall in direct-reclaim → thrash window before either oomd or kernel OOM acts. |
All three are co-dependent: oomd without zram tuning still wedges
briefly waiting for PSI to climb; zram tuning without oomd still gets
kernel-OOM victim selection wrong. Verified by `test/boot-checklist.md`
"Memory pressure" section.
Layer rationale logged in `overlay/etc/sysctl.d/95-memory-pressure.conf`
and `overlay/etc/systemd/zram-generator.conf` headers — kept inline so
the *why* survives even if this doc is deleted.
- **Bluetooth** — disabled. Enable with `systemctl enable --now bluetooth`.
- **Printing** — CUPS removed. Reinstall if needed: `dnf install cups`.
- **Snapd, Flatpak** — not installed (Flatpak optional add-on).

View file

@ -271,14 +271,41 @@ sed -i \
plymouth-set-default-theme details 2>/dev/null || true
[ -f /boot/grub2/grub.cfg ] && grub2-mkconfig -o /boot/grub2/grub.cfg 2>/dev/null || true
# zram swap (no disk swap; keys never leak to platter)
# zram swap (no disk swap; keys never leak to platter).
#
# Sizing: 16 GiB compressed (zstd ~3:1 → ~48 GiB effective). Default 8G
# filled under sustained pressure on 32+ GiB laptops running browsers +
# LSP + chat → kernel OOM (no disk-swap fallback per threat model). See
# overlay/etc/systemd/zram-generator.conf and docs/HARDENING.md "Memory
# pressure" for full rationale.
dnf install -y zram-generator || true
cat > /etc/systemd/zram-generator.conf << 'EOF'
[zram0]
zram-size = min(ram, 8192)
zram-size = min(ram, 16384)
compression-algorithm = zstd
EOF
# Memory-pressure sysctl tuning for zram-only stack. Default vm.swappiness
# assumes a slow disk; on zram the kernel must be told to swap early
# (180) and reclaim early (watermark_scale_factor=125) so it never gets
# cornered into kernel-OOM. page-cluster=0 disables read-ahead which is
# pointless on RAM-backed swap. See overlay/etc/sysctl.d/95-memory-pressure.conf
# and docs/HARDENING.md "Memory pressure" for the rationale + failure mode.
cat > /etc/sysctl.d/95-memory-pressure.conf << 'EOF'
vm.swappiness = 180
vm.watermark_scale_factor = 125
vm.page-cluster = 0
EOF
# systemd-oomd: userspace OOM killer that uses PSI (pressure stall info)
# to pick a victim cgroup BEFORE the kernel's global OOM reaper fires.
# Without oomd the kernel waits until total exhaustion then picks by
# oom_score, often killing plasmashell or the active terminal instead of
# the runaway browser tab. Fedora ships systemd-oomd-defaults with sane
# thresholds for user.slice cgroups.
dnf install -y systemd-oomd-defaults || true
systemctl enable systemd-oomd.service || true
# Patch anaconda's transaction_progress.py inside the live rootfs so that
# when the user clicks "Install", a non-fatal RPM 6.0 *scriptlet* warning
# does not get escalated to "An error occurred during the transaction"

View file

@ -0,0 +1,45 @@
# veilor-os — memory-pressure tuning for zram-only swap
#
# Rationale: veilor-os ships zram swap with NO disk swap (see THREAT-MODEL.md
# §"Lost or stolen laptop"). The kernel's default vm.* knobs assume a slow
# spinning disk and refuse to swap until physical RAM is nearly exhausted.
# Under a zram-only stack that policy is wrong on two axes:
#
# 1. zram is RAM-fast — there is no penalty for swapping early, only a
# small CPU cost for zstd compress/decompress.
# 2. Once zram fills, there is no overflow (no disk swap by design), so
# the kernel falls through to OOM. With default knobs the OOM trigger
# is slow and reactive: by the time it fires, the system has spent
# minutes in thrash (compositor/input frozen, mouse stuck) and the
# kernel picks a victim by oom_score which is often plasmashell or
# the terminal — i.e. the user's session goes down, not the runaway.
#
# What these knobs do:
#
# vm.swappiness = 180
# Tell the kernel to prefer evicting anonymous pages to (zram) swap
# over reclaiming file-backed pages. Fedora's zram-generator upstream
# recommends 180 for zram-only systems. Default 60 is tuned for HDD
# swap and leaves zram unused until too late.
#
# vm.watermark_scale_factor = 125
# Start kswapd reclaim earlier (~1.25% of RAM headroom vs default
# 0.1%). On a 32 GiB box that's ~400 MiB head start before allocations
# would otherwise stall in direct-reclaim. Trades a tiny amount of
# usable RAM for much smoother latency under bursty allocators
# (Chromium/Electron tab spawns, language server warm-up).
#
# vm.page-cluster = 0
# Read one page per swap-in instead of the default 8. Read-ahead is a
# win on rotational media because seeks dominate; on zram the seek
# cost is zero and grabbing 7 extra pages just wastes decompress
# cycles and CPU cache. Setting to 0 is the documented zram tuning.
#
# Companion: systemd-oomd is enabled in the same change so PSI-based
# pre-OOM kills land on the right cgroup before the kernel OOM reaper
# fires. Without it, even with these knobs the system can still wedge
# briefly while the kernel waits for the global watermark.
vm.swappiness = 180
vm.watermark_scale_factor = 125
vm.page-cluster = 0

View file

@ -0,0 +1,19 @@
# veilor-os — zram swap override
#
# Replaces the Fedora default config (which would otherwise set
# zram-size = min(ram, 8192) with whatever compression algorithm
# zram-generator picked, historically lzo-rle).
#
# Sizing rationale: 16 GiB compressed (typical 3:1 with zstd → ~48 GiB
# effective). Default 8 GiB filled under sustained pressure on modern
# 32+ GiB laptops running browsers + LSP + chat clients, leaving the
# kernel with no swap headroom and triggering OOM (since veilor-os has
# no disk swap fallback — see THREAT-MODEL.md "no key leak risk").
#
# Algorithm: zstd. lzo-rle is faster but ratio ~2:1; zstd is ~3:1 with
# negligible CPU cost on any post-2018 x86_64. The extra 50% effective
# swap capacity is worth more than the microseconds of compress time.
[zram0]
zram-size = min(ram, 16384)
compression-algorithm = zstd

View file

@ -91,6 +91,18 @@ before the build is considered green.
- [ ] `lsblk -f` shows LUKS2 on the main partition
- [ ] `cryptsetup luksDump /dev/...` shows argon2id, aes-xts-plain64
- [ ] `swapon` shows `zram` device, no disk swap
- [ ] `zramctl` shows `ALGORITHM=zstd` and `DISKSIZE=16G` (= 16 GiB,
not Fedora's 8 GiB default — see `overlay/etc/systemd/zram-generator.conf`)
## Memory pressure
- [ ] `systemctl is-active systemd-oomd``active` (PSI-based pre-OOM
killer; without it the kernel waits until total RAM exhaustion
then often kills plasmashell or the active terminal instead of
the runaway tab)
- [ ] `sysctl vm.swappiness vm.watermark_scale_factor vm.page-cluster`
shows `180 / 125 / 0` (default `60 / 10 / 3` is wrong for
zram-only — kernel refuses to swap until exhausted, then thrashes)
## SELinux module