veilor-os/CHANGELOG.md
veilor-org c6f65f0831
Some checks failed
Lint / Kickstart syntax (pull_request) Has been cancelled
Lint / Shell scripts (pull_request) Has been cancelled
Lint / No personal/onyx leaks (pull_request) Has been cancelled
feat(hardening): CPU/IO slice isolation for background services
Companion to the memory-pressure tuning (7d2b94b). Memory was only
half the "expensive laptop typing like a Chromebook" story — once
zram-only OOM thrash was solved, a second symptom class emerged:
post-boot CPU/IO contention on machines with high core counts.

Live incident on a 24-thread Ryzen AI 9 HX 370 / 30 GiB workstation,
2026-05-13: ~16 min after login, load avg 6.5, typing in konsole and
the address bar lagged 100s of ms. RAM/swap uncontended (8 GiB/30 GiB
used, zero swap), so the memory tuning was holding. PSI showed
cpu some=0.34 — pure scheduler contention.

Root cause: every Fedora unit ships with CPUWeight=[not set] which
maps to weight=100. Under contention the kernel splits CPU evenly
between every leaf cgroup. With the post-boot storm running
concurrently (plasma-discover ~80%, packagekitd ~33%, fwupd ~20%,
dnf-makecache firing) the compositor (kwin_wayland, plasmashell) was
losing scheduling fights against package metadata.

Three fixes shipped together:

1. system-bg.slice — CPUWeight=20, IOWeight=50, MemoryHigh=4G. Five
   service drop-ins assign packagekit, fwupd, fwupd-refresh,
   dnf-makecache, dnf5-automatic into it with Nice=10 and
   IOSchedulingClass=idle. Proportional, not a hard cap — idle
   systems still get full speed.

2. user-.slice.d/10-boost.conf — CPUWeight=300, IOWeight=200 on every
   logged-in user session. Combined with above gives a 15:1
   interactive:background ratio under contention.

3. Boot-storm sources defused: skel autostart shadow disables the
   discover update notifier auto-launch; dnf-makecache.timer
   OnBootSec=20min pushes metadata refresh past peak session
   bring-up.

One opt-in artifact: skel user-bg.slice (CPUWeight=30) for anyone
installing Syncthing, rclone, or a file indexer — drop a
Slice=user-bg.slice drop-in on the service to inherit the same
protection at the user level.

Verified live before opening this PR: load dropped 6.53 -> 3.55
within minutes of applying; cgroup placement confirmed via
systemd-cgls.

Follow-up filed in CHANGELOG (not in this PR): tuned-adm
"onyx-performance" profile silently falls back to balanced, and
EPP regresses to balance_performance on AC. Needs separate branch.
2026-05-13 10:15:35 +01:00

19 KiB
Raw Permalink Blame History

Changelog

All notable changes to veilor-os are documented here.

The format follows Keep a Changelog, and this project loosely follows Semantic Versioning during the pre-1.0 phase.

Each release section records the bug found and the fix applied so future maintainers can see why a change exists, not just what it changes.

[Unreleased]

Hardening: CPU/IO slice isolation for background services

Companion to the memory-pressure tuning (see prior entry). Memory was only half the story — once OOM thrash was solved, a second class of "why is my expensive laptop typing like a Chromebook" symptom emerged: post-boot CPU/IO contention.

Bug found

Live incident on a 24-thread Ryzen AI 9 HX 370 / 30 GiB workstation, 2026-05-13: ~16 minutes after login, load avg climbed to ~6.5, typing in konsole and the address bar lagged by hundreds of ms. RAM and swap were uncontended (8 GiB used / 30 GiB total, zero swap), so the memory-pressure work was holding. PSI showed cpu some=0.34 — pure scheduler contention.

Root cause: every Fedora unit ships with CPUWeight=[not set] (defaults to 100), so under contention the kernel's CFQ splits CPU evenly between every leaf cgroup. With the post-boot storm running concurrently:

  • plasma-discover (KDE update GUI, autostarted via /etc/xdg/autostart/org.kde.discover.notifier.desktop) — ~80 % CPU doing repo metadata refresh
  • packagekitd (the discover backend) — ~33 %
  • fwupd + fwupd-refresh — ~20 %
  • dnf-makecache.timer firing in the same window
  • kwin_wayland (~33 %) and plasmashell (~19 %) competing on equal footing with all of the above

The compositor lost scheduling fights against package metadata, hence the typing lag. zram-only swap and vm.swappiness=180 are correct for this stack but do nothing for a CPU-bound storm.

Fix applied

Two new slices in overlay/etc/systemd/system/:

  1. system-bg.sliceCPUWeight=20, IOWeight=50, MemoryHigh=4G. Drop-ins assign packagekit.service, fwupd.service, fwupd-refresh.service, dnf-makecache.service, and dnf5-automatic.service into it with Nice=10 and IOSchedulingClass=idle.
  2. user-.slice.d/10-boost.confCPUWeight=300, IOWeight=200 on every logged-in user session. Combined with above, gives a 15:1 interactive:background CPU ratio under contention. Idle systems still get full speed; weights are proportional, not hard caps.

Two boot-storm sources defused:

  • overlay/etc/skel/.config/autostart/org.kde.discover.notifier.desktop shadows the system autostart with Hidden=true. Updates still flow via dnf5-automatic.timer; users can launch Discover manually. No GUI fires at session start.
  • dnf-makecache.timer.d/10-delay.conf pushes OnBootSec=20min so metadata refresh lands past peak session bring-up.

One opt-in artifact for users:

  • overlay/etc/skel/.config/systemd/user/user-bg.slice (CPUWeight=30, IOWeight=50, MemoryHigh=3G). Veilor-os does not ship sync tools by default, but anyone installing Syncthing / rclone / a file indexer can drop a Slice=user-bg.slice drop-in on the service and inherit the same protection at the user level.

Verified live (post-incident workstation, before opening the PR):

slice              CPUWeight  IOWeight  MemoryHigh
system-bg.slice    20         50        4G
user-1000.slice    500        500       infinity
user-bg.slice      30         50        3G

cgroup placement confirmed via systemd-cgls: packagekit.service under /system.slice/system-bg.slice/, syncthing.service under /user.slice/user-1000.slice/.../user-bg.slice/. Load dropped from 6.53 → 3.55 within minutes of applying, and typing in the compositor recovered immediately on the next contention event.

Follow-up surfaced during this work (not in this PR)

While debugging "still feels laggy after slice fix" on the same workstation, found two power-profile bugs worth a separate investigation:

  1. tuned-adm active reported balanced despite the system being on AC + charging. EPP was balance_performance and all 24 cores sat pinned at scaling_min_freq (605 MHz) — typing latency was the CPU refusing to ramp on short bursts, even with no contention. Manually setting EPP to performance and switching to the stock throughput-performance profile restored snappy input.
  2. tuned-adm profile onyx-performance (shipped via overlay/etc/tuned/profiles/) silently fell back to balanced instead of activating. No errors in journalctl -u tuned. The profile config or its tuned.conf script likely has a bad exit somewhere; needs reproduction in CI and a test that asserts tuned-adm active matches what was requested.

Both are tracked for a follow-up branch — out of scope here because this PR only covers cgroup/slice isolation. Filing now so it does not get lost.

v0.7 BlueBuild OCI spike (active — v0.7-bluebuild-spike)

CI plumbing landed (~13 fixes) to unblock the first green BlueBuild run on the self-hosted Forgejo runner. Build still red as of 2026-05-08; OCI artifact + installer ISO pending green run.

Forgejo runner + build-image plumbing

  • Forgejo runner upgraded to v6.4.0 with userns-remap=default. Buildah needs --userns=host to undo the remap inside the job; added to every bluebuild build invocation.
  • Custom build image veilor-build:43 (fedora:43 + nodejs + buildah deps). Replaces the upstream BlueBuild image, which lacked Forgejo-runner-friendly tooling.
  • Workflow now runs-on: nullstone (single self-hosted runner, no nested docker).
  • Build timeout bumped 60 min → 360 min to absorb first-time secureblue base pulls on a cold runner.

Signing + registry auth

  • cosign v2.4.1 installed from upstream binary (no Fedora RPM yet for v2.4.x).
  • GHCR PAT login added so the BlueBuild step can pull ghcr.io/secureblue/kinoite-main-hardened (rate-limited anonymous).
  • cosign keypair signing — keyless OIDC fails on Forgejo (no Sigstore Fulcio integration), so we ship a static keypair under the repo and sign with cosign sign --key. Public key checked in for verification.

BlueBuild recipe pivots

  • Base image switched to ghcr.io/secureblue/kinoite-main-hardened (the actual published image). Prior reference to securecore-kinoite-hardened-userns was a planning-phase guess and did not exist.
  • Module type pivots driven by buildah-privileged + bind-mounted helper scripts hitting chmod-permitted blockers:
    • type: filestype: copy (files module's chmod step failed under bind-mount).
    • type: script + type: systemdtype: containerfile RUN (single layer, no helper-script bind-mount).

Installer ISO — pivoted

  • livemedia-creator → bootc-image-builder. livemedia-creator does not support the ostreecontainer install method (only ostreesetup/url/nfs), so the v0.7 path required the swap. Build pending OCI artifact.

Docs

  • This CHANGELOG entry.
  • ROADMAP refresh — v0.5.0 marked done, v0.7 OCI marked in-flight, installer-iso pivot recorded, USB install-log persistence default-on promise documented, v1.0 ship criteria carried over.

Infra (out-of-tree, recorded for traceability)

  • 2026-05-08 — Headscale OIDC 403 fixed by adding 172.20.0.0/24 (docker proxy bridge gateway) to the no-guest@file Traefik middleware allowlist on nullstone. Unblocks tag:guest provisioning for veilor-os clients.
  • All GitHub remotes removed from veilor-os local clones, six worktrees, and sibling projects (auth-limbo, minecraft-launcher, minecraft-server, infra). GH push-mirrors disabled. Forgejo-only since 2026-05-05.

Planned (deferred / parking)

  • v0.3 polish — Plymouth black theme, SDDM theme, Konsole profile, wallpaper SVG. Re-enable init_on_alloc=1 init_on_free=1 post-install via veilor-firstboot so live boot stays fast but installed system keeps the memory hygiene.
  • USBGuard auto-snapshot on first boot.
  • veilor-firstboot UX improvements (cleaner banner, better error paths).

[0.5.0] — 2026-05-06

Tag: v0.5.0final kickstart-path release.

The hardened-Fedora-43 kickstart line ships. Future work moves to the v0.7 BlueBuild OCI spike; the kickstart retires at v1.0.

Added

  • First green Forgejo-CI ISO build (~2.7 GB live ISO, EFI + BIOS bootable). Released as ci-latest artifact at git.s8n.ru/veilor-org/veilor-os/releases/tag/ci-latest.
  • gum TUI installer wrapping Anaconda — single LUKS prompt, locale locked to en_US.UTF-8, admin-password first-boot flow.
  • LUKS2 argon2id + btrfs subvols install via Anaconda, written through /etc/kernel/cmdline so BLS entries carry the cmdline veilor needs.
  • 3-mode veilor-power CLI (save | mid | perf) with AC/battery udev auto-switching, lifted into the overlay.
  • KDE black theme + Fira Code system font, branded /etc/os-release, GRUB rebrand, plymouth detail-text boot.
  • Hardening: SELinux enforcing, USBGuard default-block, fail2ban + auditd, firewalld drop zone, NTS chrony, DNS-over-TLS, locked root.
  • Self-hosted Forgejo CI on nullstone replaces the GitHub Actions build pipeline.

Fixed (delta from v0.2.5 → v0.5.0 — 35+ failure classes)

The full v0.5.x grind is documented per-release in commit messages (v0.5.21v0.5.32). Headline fixes:

  • --location=none skipped CollectKernelArgumentsTask. Anaconda shipped BLS entries with empty cmdline. Fix: write /etc/kernel/cmdline directly + /etc/default/grub + grubby + explicit kernel-install add. (v0.5.31)
  • transaction_progress.py install scroll masked real failures when patched too broadly. Narrowed the patch to only suppress Configuring xxx.x86_64. (v0.5.28 → v0.5.29)
  • Locale dialog raced anaconda startup. Lock to en_US.UTF-8, defer locale choice to veilor-postinstall (v0.7 scope). (v0.5.28)
  • fbcon=nodefer + GRUB rebrand + ASCII gum cursor make the install flow legible on linux fbcon. (v0.5.27)
  • rd.luks.uuid injected via grubby --update-kernel=ALL in chroot %post — earlier releases relied on Anaconda which silently dropped it. (v0.5.23, v0.5.27)
  • 9-agent research wave identified the v0.5.32 blocker map; 7 blockers shipped in one bundle.

Notes

  • Treat v0.5.0 as the portfolio anchor for the kickstart path. v0.5.32-rc was the last test-run; v0.5.0 was tagged on 2026-05-06 as the freeze point.
  • v0.6 was cancelled the same day (folded into v0.7). See docs/ROADMAP.md strategy-pivot section.

[0.2.5] — 2026-05-01

Commit: 8515bdb

Fixed

  • Live boot took 5+ minutes on KVM. Dracut sat at the parse-livenet stage for what looked like a hang. Root cause: init_on_alloc=1 and init_on_free=1 zero every memory page on allocation and free. In a virtualised guest with paravirtual memory, this multiplied the early-boot cost by ~5x. Removed both flags from the live kernel cmdline.

Notes

  • The two memory-hygiene flags will be re-added on the installed system via veilor-firstboot in v0.3 — the cost on bare metal is negligible, the live-ISO penalty is the only place it bites.
  • Live cmdline retained: lockdown=integrity slab_nomerge randomize_kstack_offset=on vsyscall=none.

[0.2.4] — 2026-05-01

Commit: a23ce63

Fixed

  • VM booted but stalled at dracut "parse-livenet" looking for a label that never matched. Root cause: an upstream bug in livecd-toolsimgcreate/live.py::__get_efi_image_stanza() writes the EFI grub stanza as root=live:LABEL=... for dracut. Dracut on live ISOs expects live:CDLABEL=... for ISO9660 volume labels; LABEL= matches partition labels which a live ISO doesn't have.
  • Patched live.py in-place inside the CI build container before invoking livecd-creator. With the patched stanza, the VM booted cleanly to the SDDM login prompt.

Changed

  • CI workflow now seds the patch into the installed live.py and asserts the patch landed before continuing the build.

Notes

  • Bug also affects livemedia-creator --make-iso --no-virt and any other consumer of imgcreate.LiveImageCreator. Worth filing upstream once we have a clean repro recipe.

[0.2.3] — 2026-05-01

Commit: ef54a24

Added

  • Manual useradd admin invocation in chroot %post. livecd-creator does not run an installer phase, so the kickstart user directive is silently ignored. Without this, the booted live system has no admin account at all, and SDDM falls back to "no users" — login impossible.

Fixed

  • /etc/os-release was still pointing at stock Fedora. Even with the overlay tree successfully copied, kde-theme-apply.sh was resolving /etc/os-release.d/veilor from the wrong path (the build host's repo, not the overlay's installed location).
  • Rewired the symlink chain cleanly: /etc/os-release → ../usr/lib/os-release, with the override file written to /usr/lib/os-release directly during %post.
  • Branding now reflects veilor-os in /etc/os-release, hostnamectl, and the SDDM session menu.

Notes

  • The user --name=admin directive stays in the kickstart for documentation and for any future livemedia-creator-based installer ISO that does honour it.

[0.2.2] — 2026-05-01

Commit: 3408841

Fixed

  • Overlay was partially copied — boot worked but veilor-power, KDE theme, custom scripts were all missing. Found via offline debugfs inspection of the v0.2.1 rootfs: tuned profiles, sshd hardening, sudoers entries, and systemd units were present, but /usr/share/veilor-os/{assets,scripts} was empty.
  • Root cause: %post --nochroot ran with set -eu. When the first cp of a non-essential overlay file returned non-zero, the script aborted, leaving the assets/scripts copy step un-executed. None of the chroot %post scripts could then find what they needed and they silently no-op'd.

Changed

  • %post --nochroot now uses set +e around cp/mkdir so a partial-permissions error on one tree doesn't kill the whole copy.
  • Added /var/log/veilor-nochroot.log — every action in %post --nochroot now traces with timestamps. Future debugging is one journalctl --boot away.

Notes

  • The looser error handling is intentional but bounded — only the overlay copy uses set +e. Hardening scripts that follow run with strict mode.

[0.2.1] — 2026-05-01

Commit: 9c6136f

Fixed

  • ISO booted, but it was effectively bare Fedora KDE. No hardening, no theme, no veilor-power, no /etc/os-release override. Confirmed by mounting v0.2.0 with debugfs: /etc/os-release symlinked to ../usr/lib/os-release (Fedora's default), no /usr/share/veilor-os, no overlay files anywhere.
  • Root cause: %post --nochroot hardcoded /mnt/sysimage as the destination. /mnt/sysimage is the livemedia-creator install root. We had switched the build pipeline to livecd-creator, which exposes the destination as $INSTALL_ROOT — a different path inside its tmpfs sandbox.
  • Switched the copy target to $INSTALL_ROOT.

Notes

  • Partial overlay landed in v0.2.1 (tuned, sshd, sddm.conf) — but /usr/share/veilor-os/{assets,scripts} was still missing because set -eu aborted partway through the cp tree. That fix is in v0.2.2.
  • Lesson learned: tooling-specific environment variables matter. $INSTALL_ROOT is the portable answer; /mnt/sysimage is a livemedia-creator-only convention.

[0.2.0] — 2026-04-30

Commit: 7c4a94d (tagged release)

Added

  • First green ISO. Reproducible build pipeline lands.
  • GitHub Actions workflow build-iso.yml produces a UEFI+BIOS-bootable live ISO from kickstart/veilor-os.ks.
  • CI: kickstart syntax linting (ksvalidator) gate.
  • Kickstart based on Fedora 43, KDE Plasma minimal, hardening packages selected (fail2ban, usbguard, tuned, audit, firewalld).
  • Overlay tree authored: tuned profiles, sshd hardening, sysctl drop-in, sudoers, udev rules, KDE theme assets, Fira Code font.
  • 3-mode power profiles: veilor-power save | mid | perf with AC/battery udev auto-switching.

Notes — known limitations of v0.2.0

  • The overlay never actually applied to the installed system. The %post --nochroot copy step targeted /mnt/sysimage (livemedia-creator's install root) but the build pipeline had moved to livecd-creator, which uses $INSTALL_ROOT. Result: the ISO boots and presents a working KDE Plasma desktop, but it is in practice stock Fedora 43 KDE with no veilor-os hardening, branding, theme, or power scripts applied.
  • v0.2.0 is best understood as a build-pipeline milestone — the ISO format, EFI/BIOS bootability, partitioning, and squashfs build all work end-to-end. The userspace customisation layer was wired but not delivering. Treat v0.2.0 as proof-of-build, not as a feature-complete release.
  • See v0.2.5 for the first feature-complete ISO that actually ships veilor-os hardening and branding into the running system.

Build pipeline path to green

For posterity, the issues resolved between v0.1 (scaffold) and v0.2.0 (first green ISO):

  • pcre2 / selinux-policy version skew on stock Fedora 43 base — worked around with a pinned fix-repo for the local build only; CI uses dnf upgrade --refresh to sidestep entirely.
  • KDE Plasma hard-deps (cups, geoclue2, ModemManager, PackageKit) — kept at the package level, masked at the daemon level.
  • %post --nochroot source path — multi-path detection added so the overlay can be sourced from /work (CI) or /run/install/repo (virt) or kickstart-relative (no-virt).
  • livemedia-creator --make-iso --no-virt produced a squashfs but no EFI/BOOT image. Switched to livecd-creator (livecd-tools) which is purpose-built for live ISOs and handles EFI grafting.
  • Tmpdir on /tmp exhausted the GitHub Actions tmpfs cap (16GB vs ~30GB working set). Moved to /var/lmc on the runner's host ext4.

[0.1.0] — 2026-04-29

Commit: 1822005

Added

  • Initial repo scaffold: kickstart/, build/, overlay/, scripts/, assets/, docs/, test/.
  • Kickstart skeleton (Fedora 43 KDE base, single-prompt LUKS install, hardened bootloader cmdline, locked root, blank-password admin with chage -d 0 to force first-boot reset).
  • Hardening scripts ported and rebranded from operator's reference system: base hardening, kernel hardening, custom SELinux policy module veilor-systemd.
  • KDE theme: BreezeBlackPure base + grey accent (#686B6F).
  • Fira Code chosen as system font (Fedora fira-code-fonts, SIL OFL 1.1).
  • Test harness: VM runner (test/run-vm.sh) with QEMU + OVMF for fast iteration, with SECBOOT=1 and FRESH=1 modes.
  • Documentation: BUILD.md, INSTALL.md, HARDENING.md, POWER.md, boot-checklist.md.

Notes

  • v0.1 was scaffold-only — no green ISO yet. Build pipeline iterated through ~22 distinct toolchain issues before producing v0.2.0.
  • All onyx references stripped from shipped artifacts; comments refer to "reference system" only.