feat(hardening): add memory-pressure tuning for zram-only stack
veilor-os runs zram-only swap (THREAT-MODEL.md — no key leak from
disk swap). With kernel defaults that policy bites: once zram fills
there is no overflow tier, the kernel waits until total exhaustion
to trigger OOM, then picks a victim by oom_score and frequently
kills plasmashell or the foreground terminal instead of the leaking
browser tab. Mouse locks for minutes during the thrash window.
Three co-dependent layers:
1. systemd-oomd enabled — PSI-based pre-OOM killer fires at cgroup
boundaries before the kernel reaper. Fedora's systemd-oomd-defaults
ship sane thresholds for user.slice; installed in kickstart and
layered in bluebuild containerfile, enabled in both unit-toggle
blocks.
2. zram bumped 8 GiB lzo-rle (Fedora default) -> 16 GiB zstd. zstd
gives ~3:1 (~48 GiB effective) at negligible CPU cost on any
post-2018 x86_64. 8 GiB filled in practice on 32+ GiB laptops
running Chromium + LSP + chat clients.
3. /etc/sysctl.d/95-memory-pressure.conf:
- vm.swappiness=180 (zram is RAM-fast, swap early; default 60
assumes HDD)
- vm.watermark_scale_factor=125 (kswapd reclaim starts ~1.25%
headroom vs default 0.1%; ~400 MiB head start on 32 GiB)
- vm.page-cluster=0 (no read-ahead; pointless on RAM-backed swap,
wastes decompress)
Without any one of the three the system still wedges briefly: oomd
without zram tuning waits for PSI to climb; zram tuning without oomd
gets victim selection wrong.
Verified by new test/boot-checklist.md "Memory pressure" section.
Inline rationale headers in both overlay files so the why survives
doc drift. Trigger event: onyx (Fedora 43, not veilor-os) thrashed
2026-05-11; same defaults shipped to veilor-os, fixed here too.
This commit is contained in:
parent
505b5f0006
commit
7d2b94b5be
6 changed files with 136 additions and 2 deletions
|
|
@ -126,6 +126,7 @@ modules:
|
|||
tailscale \
|
||||
yggdrasil \
|
||||
zram-generator \
|
||||
systemd-oomd-defaults \
|
||||
jq \
|
||||
vim-enhanced \
|
||||
tmux \
|
||||
|
|
@ -152,6 +153,7 @@ modules:
|
|||
systemctl enable veilor-modules-lock.service 2>/dev/null || true ; \
|
||||
systemctl enable veilor-postinstall.service 2>/dev/null || true ; \
|
||||
systemctl enable veilor-doctor.timer 2>/dev/null || true ; \
|
||||
systemctl enable systemd-oomd.service 2>/dev/null || true ; \
|
||||
} ; \
|
||||
rpm-ostree cleanup -m ; \
|
||||
ostree container commit
|
||||
|
|
|
|||
|
|
@ -188,6 +188,35 @@ Splunk via HEC bridge.
|
|||
## What's *not* enabled by default
|
||||
|
||||
- **Disk swap** — replaced by zram (RAM-only, no key leak risk).
|
||||
|
||||
## Memory pressure
|
||||
|
||||
veilor-os runs **zram-only swap** (see THREAT-MODEL.md — keeps cleartext
|
||||
session keys out of any persistent allocation that would survive
|
||||
suspend-to-disk or a yanked drive). That stance has a sharp edge: once
|
||||
zram fills, there is no overflow tier. With stock kernel defaults the
|
||||
result is a multi-minute thrash — input compositor frozen, mouse stuck,
|
||||
keyboard ignored — followed by a kernel OOM kill that picks the wrong
|
||||
victim (often `plasmashell` or the foreground terminal) because the
|
||||
runaway browser tab has a lower oom_score than the long-lived session
|
||||
process. The user's desktop dies; the leaking app survives.
|
||||
|
||||
Three layers of mitigation ship by default:
|
||||
|
||||
| Layer | File | What it does | Failure mode if absent |
|
||||
|-------|------|--------------|------------------------|
|
||||
| **systemd-oomd** | enabled in `kickstart/veilor-os.ks` `%post` and in `bluebuild/recipe.yml` unit-toggle RUN | PSI-based pre-OOM killer — picks the cgroup under highest memory+IO pressure and terminates it *before* the kernel's global reaper fires. Reads from `/proc/pressure/*`, kills at the cgroup boundary so siblings survive. | Kernel waits until total exhaustion. Picks by oom_score → plasmashell / terminal die, browser tab keeps leaking. Mouse locks during the wait. |
|
||||
| **zram-generator** override | `overlay/etc/systemd/zram-generator.conf` (and matching `%post` write) | 16 GiB compressed with `zstd` (~3:1 → ~48 GiB effective). Replaces Fedora default 8 GiB / lzo-rle. | 8 GiB fills under sustained pressure on 32+ GiB laptops running Chromium + LSP + chat. No overflow (no disk swap) → straight to OOM. |
|
||||
| **vm.* sysctl** | `overlay/etc/sysctl.d/95-memory-pressure.conf` | `swappiness=180` (use zram early — it's RAM-fast), `watermark_scale_factor=125` (kswapd starts reclaim ~1.25 % headroom vs default 0.1 %), `page-cluster=0` (no read-ahead — pointless on RAM-backed swap, wastes decompress cycles). | Defaults `60 / 10 / 3` assume slow HDD swap. Kernel refuses to swap until allocations stall in direct-reclaim → thrash window before either oomd or kernel OOM acts. |
|
||||
|
||||
All three are co-dependent: oomd without zram tuning still wedges
|
||||
briefly waiting for PSI to climb; zram tuning without oomd still gets
|
||||
kernel-OOM victim selection wrong. Verified by `test/boot-checklist.md`
|
||||
"Memory pressure" section.
|
||||
|
||||
Layer rationale logged in `overlay/etc/sysctl.d/95-memory-pressure.conf`
|
||||
and `overlay/etc/systemd/zram-generator.conf` headers — kept inline so
|
||||
the *why* survives even if this doc is deleted.
|
||||
- **Bluetooth** — disabled. Enable with `systemctl enable --now bluetooth`.
|
||||
- **Printing** — CUPS removed. Reinstall if needed: `dnf install cups`.
|
||||
- **Snapd, Flatpak** — not installed (Flatpak optional add-on).
|
||||
|
|
|
|||
|
|
@ -271,14 +271,41 @@ sed -i \
|
|||
plymouth-set-default-theme details 2>/dev/null || true
|
||||
[ -f /boot/grub2/grub.cfg ] && grub2-mkconfig -o /boot/grub2/grub.cfg 2>/dev/null || true
|
||||
|
||||
# zram swap (no disk swap; keys never leak to platter)
|
||||
# zram swap (no disk swap; keys never leak to platter).
|
||||
#
|
||||
# Sizing: 16 GiB compressed (zstd ~3:1 → ~48 GiB effective). Default 8G
|
||||
# filled under sustained pressure on 32+ GiB laptops running browsers +
|
||||
# LSP + chat → kernel OOM (no disk-swap fallback per threat model). See
|
||||
# overlay/etc/systemd/zram-generator.conf and docs/HARDENING.md "Memory
|
||||
# pressure" for full rationale.
|
||||
dnf install -y zram-generator || true
|
||||
cat > /etc/systemd/zram-generator.conf << 'EOF'
|
||||
[zram0]
|
||||
zram-size = min(ram, 8192)
|
||||
zram-size = min(ram, 16384)
|
||||
compression-algorithm = zstd
|
||||
EOF
|
||||
|
||||
# Memory-pressure sysctl tuning for zram-only stack. Default vm.swappiness
|
||||
# assumes a slow disk; on zram the kernel must be told to swap early
|
||||
# (180) and reclaim early (watermark_scale_factor=125) so it never gets
|
||||
# cornered into kernel-OOM. page-cluster=0 disables read-ahead which is
|
||||
# pointless on RAM-backed swap. See overlay/etc/sysctl.d/95-memory-pressure.conf
|
||||
# and docs/HARDENING.md "Memory pressure" for the rationale + failure mode.
|
||||
cat > /etc/sysctl.d/95-memory-pressure.conf << 'EOF'
|
||||
vm.swappiness = 180
|
||||
vm.watermark_scale_factor = 125
|
||||
vm.page-cluster = 0
|
||||
EOF
|
||||
|
||||
# systemd-oomd: userspace OOM killer that uses PSI (pressure stall info)
|
||||
# to pick a victim cgroup BEFORE the kernel's global OOM reaper fires.
|
||||
# Without oomd the kernel waits until total exhaustion then picks by
|
||||
# oom_score, often killing plasmashell or the active terminal instead of
|
||||
# the runaway browser tab. Fedora ships systemd-oomd-defaults with sane
|
||||
# thresholds for user.slice cgroups.
|
||||
dnf install -y systemd-oomd-defaults || true
|
||||
systemctl enable systemd-oomd.service || true
|
||||
|
||||
# Patch anaconda's transaction_progress.py inside the live rootfs so that
|
||||
# when the user clicks "Install", a non-fatal RPM 6.0 *scriptlet* warning
|
||||
# does not get escalated to "An error occurred during the transaction"
|
||||
|
|
|
|||
45
overlay/etc/sysctl.d/95-memory-pressure.conf
Normal file
45
overlay/etc/sysctl.d/95-memory-pressure.conf
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
# veilor-os — memory-pressure tuning for zram-only swap
|
||||
#
|
||||
# Rationale: veilor-os ships zram swap with NO disk swap (see THREAT-MODEL.md
|
||||
# §"Lost or stolen laptop"). The kernel's default vm.* knobs assume a slow
|
||||
# spinning disk and refuse to swap until physical RAM is nearly exhausted.
|
||||
# Under a zram-only stack that policy is wrong on two axes:
|
||||
#
|
||||
# 1. zram is RAM-fast — there is no penalty for swapping early, only a
|
||||
# small CPU cost for zstd compress/decompress.
|
||||
# 2. Once zram fills, there is no overflow (no disk swap by design), so
|
||||
# the kernel falls through to OOM. With default knobs the OOM trigger
|
||||
# is slow and reactive: by the time it fires, the system has spent
|
||||
# minutes in thrash (compositor/input frozen, mouse stuck) and the
|
||||
# kernel picks a victim by oom_score which is often plasmashell or
|
||||
# the terminal — i.e. the user's session goes down, not the runaway.
|
||||
#
|
||||
# What these knobs do:
|
||||
#
|
||||
# vm.swappiness = 180
|
||||
# Tell the kernel to prefer evicting anonymous pages to (zram) swap
|
||||
# over reclaiming file-backed pages. Fedora's zram-generator upstream
|
||||
# recommends 180 for zram-only systems. Default 60 is tuned for HDD
|
||||
# swap and leaves zram unused until too late.
|
||||
#
|
||||
# vm.watermark_scale_factor = 125
|
||||
# Start kswapd reclaim earlier (~1.25% of RAM headroom vs default
|
||||
# 0.1%). On a 32 GiB box that's ~400 MiB head start before allocations
|
||||
# would otherwise stall in direct-reclaim. Trades a tiny amount of
|
||||
# usable RAM for much smoother latency under bursty allocators
|
||||
# (Chromium/Electron tab spawns, language server warm-up).
|
||||
#
|
||||
# vm.page-cluster = 0
|
||||
# Read one page per swap-in instead of the default 8. Read-ahead is a
|
||||
# win on rotational media because seeks dominate; on zram the seek
|
||||
# cost is zero and grabbing 7 extra pages just wastes decompress
|
||||
# cycles and CPU cache. Setting to 0 is the documented zram tuning.
|
||||
#
|
||||
# Companion: systemd-oomd is enabled in the same change so PSI-based
|
||||
# pre-OOM kills land on the right cgroup before the kernel OOM reaper
|
||||
# fires. Without it, even with these knobs the system can still wedge
|
||||
# briefly while the kernel waits for the global watermark.
|
||||
|
||||
vm.swappiness = 180
|
||||
vm.watermark_scale_factor = 125
|
||||
vm.page-cluster = 0
|
||||
19
overlay/etc/systemd/zram-generator.conf
Normal file
19
overlay/etc/systemd/zram-generator.conf
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
# veilor-os — zram swap override
|
||||
#
|
||||
# Replaces the Fedora default config (which would otherwise set
|
||||
# zram-size = min(ram, 8192) with whatever compression algorithm
|
||||
# zram-generator picked, historically lzo-rle).
|
||||
#
|
||||
# Sizing rationale: 16 GiB compressed (typical 3:1 with zstd → ~48 GiB
|
||||
# effective). Default 8 GiB filled under sustained pressure on modern
|
||||
# 32+ GiB laptops running browsers + LSP + chat clients, leaving the
|
||||
# kernel with no swap headroom and triggering OOM (since veilor-os has
|
||||
# no disk swap fallback — see THREAT-MODEL.md "no key leak risk").
|
||||
#
|
||||
# Algorithm: zstd. lzo-rle is faster but ratio ~2:1; zstd is ~3:1 with
|
||||
# negligible CPU cost on any post-2018 x86_64. The extra 50% effective
|
||||
# swap capacity is worth more than the microseconds of compress time.
|
||||
|
||||
[zram0]
|
||||
zram-size = min(ram, 16384)
|
||||
compression-algorithm = zstd
|
||||
|
|
@ -91,6 +91,18 @@ before the build is considered green.
|
|||
- [ ] `lsblk -f` shows LUKS2 on the main partition
|
||||
- [ ] `cryptsetup luksDump /dev/...` shows argon2id, aes-xts-plain64
|
||||
- [ ] `swapon` shows `zram` device, no disk swap
|
||||
- [ ] `zramctl` shows `ALGORITHM=zstd` and `DISKSIZE=16G` (= 16 GiB,
|
||||
not Fedora's 8 GiB default — see `overlay/etc/systemd/zram-generator.conf`)
|
||||
|
||||
## Memory pressure
|
||||
|
||||
- [ ] `systemctl is-active systemd-oomd` → `active` (PSI-based pre-OOM
|
||||
killer; without it the kernel waits until total RAM exhaustion
|
||||
then often kills plasmashell or the active terminal instead of
|
||||
the runaway tab)
|
||||
- [ ] `sysctl vm.swappiness vm.watermark_scale_factor vm.page-cluster`
|
||||
shows `180 / 125 / 0` (default `60 / 10 / 3` is wrong for
|
||||
zram-only — kernel refuses to swap until exhausted, then thrashes)
|
||||
|
||||
## SELinux module
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue