veilor-os/CHANGELOG.md
veilor-org c6f65f0831
Some checks failed
Lint / Kickstart syntax (pull_request) Has been cancelled
Lint / Shell scripts (pull_request) Has been cancelled
Lint / No personal/onyx leaks (pull_request) Has been cancelled
feat(hardening): CPU/IO slice isolation for background services
Companion to the memory-pressure tuning (7d2b94b). Memory was only
half the "expensive laptop typing like a Chromebook" story — once
zram-only OOM thrash was solved, a second symptom class emerged:
post-boot CPU/IO contention on machines with high core counts.

Live incident on a 24-thread Ryzen AI 9 HX 370 / 30 GiB workstation,
2026-05-13: ~16 min after login, load avg 6.5, typing in konsole and
the address bar lagged 100s of ms. RAM/swap uncontended (8 GiB/30 GiB
used, zero swap), so the memory tuning was holding. PSI showed
cpu some=0.34 — pure scheduler contention.

Root cause: every Fedora unit ships with CPUWeight=[not set] which
maps to weight=100. Under contention the kernel splits CPU evenly
between every leaf cgroup. With the post-boot storm running
concurrently (plasma-discover ~80%, packagekitd ~33%, fwupd ~20%,
dnf-makecache firing) the compositor (kwin_wayland, plasmashell) was
losing scheduling fights against package metadata.

Three fixes shipped together:

1. system-bg.slice — CPUWeight=20, IOWeight=50, MemoryHigh=4G. Five
   service drop-ins assign packagekit, fwupd, fwupd-refresh,
   dnf-makecache, dnf5-automatic into it with Nice=10 and
   IOSchedulingClass=idle. Proportional, not a hard cap — idle
   systems still get full speed.

2. user-.slice.d/10-boost.conf — CPUWeight=300, IOWeight=200 on every
   logged-in user session. Combined with above gives a 15:1
   interactive:background ratio under contention.

3. Boot-storm sources defused: skel autostart shadow disables the
   discover update notifier auto-launch; dnf-makecache.timer
   OnBootSec=20min pushes metadata refresh past peak session
   bring-up.

One opt-in artifact: skel user-bg.slice (CPUWeight=30) for anyone
installing Syncthing, rclone, or a file indexer — drop a
Slice=user-bg.slice drop-in on the service to inherit the same
protection at the user level.

Verified live before opening this PR: load dropped 6.53 -> 3.55
within minutes of applying; cgroup placement confirmed via
systemd-cgls.

Follow-up filed in CHANGELOG (not in this PR): tuned-adm
"onyx-performance" profile silently falls back to balanced, and
EPP regresses to balance_performance on AC. Needs separate branch.
2026-05-13 10:15:35 +01:00

486 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Changelog
All notable changes to veilor-os are documented here.
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project loosely follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
during the pre-1.0 phase.
Each release section records the **bug found** and the **fix applied** so
future maintainers can see why a change exists, not just what it changes.
## [Unreleased]
### Hardening: CPU/IO slice isolation for background services
Companion to the memory-pressure tuning (see prior entry). Memory was
only half the story — once OOM thrash was solved, a second class of
"why is my expensive laptop typing like a Chromebook" symptom emerged:
post-boot CPU/IO contention.
#### Bug found
Live incident on a 24-thread Ryzen AI 9 HX 370 / 30 GiB workstation,
2026-05-13: ~16 minutes after login, load avg climbed to ~6.5, typing
in konsole and the address bar lagged by hundreds of ms. RAM and swap
were uncontended (8 GiB used / 30 GiB total, zero swap), so the
memory-pressure work was holding. PSI showed `cpu some=0.34` — pure
scheduler contention.
Root cause: every Fedora unit ships with `CPUWeight=[not set]`
(defaults to 100), so under contention the kernel's CFQ splits CPU
evenly between every leaf cgroup. With the post-boot storm running
concurrently:
- `plasma-discover` (KDE update GUI, autostarted via
`/etc/xdg/autostart/org.kde.discover.notifier.desktop`) — ~80 % CPU
doing repo metadata refresh
- `packagekitd` (the discover backend) — ~33 %
- `fwupd` + `fwupd-refresh` — ~20 %
- `dnf-makecache.timer` firing in the same window
- `kwin_wayland` (~33 %) and `plasmashell` (~19 %) competing on equal
footing with all of the above
The compositor lost scheduling fights against package metadata, hence
the typing lag. zram-only swap and `vm.swappiness=180` are correct for
this stack but do nothing for a CPU-bound storm.
#### Fix applied
Two new slices in `overlay/etc/systemd/system/`:
1. **`system-bg.slice`** — `CPUWeight=20`, `IOWeight=50`,
`MemoryHigh=4G`. Drop-ins assign `packagekit.service`,
`fwupd.service`, `fwupd-refresh.service`, `dnf-makecache.service`,
and `dnf5-automatic.service` into it with `Nice=10` and
`IOSchedulingClass=idle`.
2. **`user-.slice.d/10-boost.conf`** — `CPUWeight=300`,
`IOWeight=200` on every logged-in user session. Combined with
above, gives a **15:1** interactive:background CPU ratio under
contention. Idle systems still get full speed; weights are
proportional, not hard caps.
Two boot-storm sources defused:
- `overlay/etc/skel/.config/autostart/org.kde.discover.notifier.desktop`
shadows the system autostart with `Hidden=true`. Updates still flow
via `dnf5-automatic.timer`; users can launch Discover manually. No
GUI fires at session start.
- `dnf-makecache.timer.d/10-delay.conf` pushes `OnBootSec=20min` so
metadata refresh lands past peak session bring-up.
One opt-in artifact for users:
- `overlay/etc/skel/.config/systemd/user/user-bg.slice`
(`CPUWeight=30`, `IOWeight=50`, `MemoryHigh=3G`). Veilor-os does not
ship sync tools by default, but anyone installing Syncthing /
rclone / a file indexer can drop a `Slice=user-bg.slice` drop-in
on the service and inherit the same protection at the user level.
Verified live (post-incident workstation, before opening the PR):
```
slice CPUWeight IOWeight MemoryHigh
system-bg.slice 20 50 4G
user-1000.slice 500 500 infinity
user-bg.slice 30 50 3G
```
cgroup placement confirmed via `systemd-cgls`: `packagekit.service`
under `/system.slice/system-bg.slice/`, `syncthing.service` under
`/user.slice/user-1000.slice/.../user-bg.slice/`. Load dropped from
6.53 → 3.55 within minutes of applying, and typing in the compositor
recovered immediately on the next contention event.
#### Follow-up surfaced during this work (not in this PR)
While debugging "still feels laggy after slice fix" on the same
workstation, found two power-profile bugs worth a separate
investigation:
1. `tuned-adm active` reported `balanced` despite the system being on
AC + charging. EPP was `balance_performance` and all 24 cores sat
pinned at `scaling_min_freq` (605 MHz) — typing latency was the
CPU refusing to ramp on short bursts, even with no contention.
Manually setting EPP to `performance` and switching to the stock
`throughput-performance` profile restored snappy input.
2. `tuned-adm profile onyx-performance` (shipped via
`overlay/etc/tuned/profiles/`) **silently fell back to `balanced`**
instead of activating. No errors in `journalctl -u tuned`. The
profile config or its `tuned.conf` script likely has a bad exit
somewhere; needs reproduction in CI and a test that asserts
`tuned-adm active` matches what was requested.
Both are tracked for a follow-up branch — out of scope here because
this PR only covers cgroup/slice isolation. Filing now so it does not
get lost.
### v0.7 BlueBuild OCI spike (active — `v0.7-bluebuild-spike`)
CI plumbing landed (~13 fixes) to unblock the first green BlueBuild
run on the self-hosted Forgejo runner. **Build still red** as of
2026-05-08; OCI artifact + installer ISO pending green run.
#### Forgejo runner + build-image plumbing
- Forgejo runner upgraded to **v6.4.0** with `userns-remap=default`.
Buildah needs `--userns=host` to undo the remap inside the job; added
to every `bluebuild build` invocation.
- Custom build image **`veilor-build:43`** (fedora:43 + nodejs +
buildah deps). Replaces the upstream BlueBuild image, which lacked
Forgejo-runner-friendly tooling.
- Workflow now **`runs-on: nullstone`** (single self-hosted runner,
no nested docker).
- Build timeout bumped **60 min → 360 min** to absorb first-time
secureblue base pulls on a cold runner.
#### Signing + registry auth
- **cosign v2.4.1** installed from upstream binary (no Fedora RPM yet
for v2.4.x).
- **GHCR PAT login** added so the BlueBuild step can pull
`ghcr.io/secureblue/kinoite-main-hardened` (rate-limited anonymous).
- **cosign keypair signing** — keyless OIDC fails on Forgejo (no
Sigstore Fulcio integration), so we ship a static keypair under
the repo and sign with `cosign sign --key`. Public key checked in
for verification.
#### BlueBuild recipe pivots
- Base image switched to **`ghcr.io/secureblue/kinoite-main-hardened`**
(the actual published image). Prior reference to
`securecore-kinoite-hardened-userns` was a planning-phase guess and
did not exist.
- Module type pivots driven by buildah-privileged + bind-mounted helper
scripts hitting chmod-permitted blockers:
- `type: files`**`type: copy`** (files module's chmod step
failed under bind-mount).
- `type: script` + `type: systemd`**`type: containerfile` RUN**
(single layer, no helper-script bind-mount).
#### Installer ISO — pivoted
- **livemedia-creator → bootc-image-builder.** livemedia-creator does
not support the `ostreecontainer` install method (only
`ostreesetup`/`url`/`nfs`), so the v0.7 path required the swap.
Build pending OCI artifact.
#### Docs
- This CHANGELOG entry.
- ROADMAP refresh — v0.5.0 marked done, v0.7 OCI marked in-flight,
installer-iso pivot recorded, USB install-log persistence default-on
promise documented, v1.0 ship criteria carried over.
### Infra (out-of-tree, recorded for traceability)
- **2026-05-08** — Headscale OIDC 403 fixed by adding
`172.20.0.0/24` (docker proxy bridge gateway) to the
`no-guest@file` Traefik middleware allowlist on nullstone.
Unblocks `tag:guest` provisioning for veilor-os clients.
- **All GitHub remotes removed** from veilor-os local clones, six
worktrees, and sibling projects (auth-limbo, minecraft-launcher,
minecraft-server, infra). GH push-mirrors disabled. Forgejo-only
since 2026-05-05.
### Planned (deferred / parking)
- v0.3 polish — Plymouth black theme, SDDM theme, Konsole profile,
wallpaper SVG. Re-enable `init_on_alloc=1 init_on_free=1` post-install
via `veilor-firstboot` so live boot stays fast but installed system
keeps the memory hygiene.
- USBGuard auto-snapshot on first boot.
- veilor-firstboot UX improvements (cleaner banner, better error paths).
---
## [0.5.0] — 2026-05-06
**Tag:** `v0.5.0`**final kickstart-path release**.
The hardened-Fedora-43 kickstart line ships. Future work moves to
the v0.7 BlueBuild OCI spike; the kickstart retires at v1.0.
### Added
- First green Forgejo-CI ISO build (~2.7 GB live ISO, EFI + BIOS
bootable). Released as `ci-latest` artifact at
`git.s8n.ru/veilor-org/veilor-os/releases/tag/ci-latest`.
- **gum TUI installer** wrapping Anaconda — single LUKS prompt,
locale locked to `en_US.UTF-8`, admin-password first-boot flow.
- **LUKS2 argon2id + btrfs subvols** install via Anaconda, written
through `/etc/kernel/cmdline` so BLS entries carry the cmdline
veilor needs.
- **3-mode `veilor-power` CLI** (`save | mid | perf`) with AC/battery
udev auto-switching, lifted into the overlay.
- **KDE black theme** + Fira Code system font, branded
`/etc/os-release`, GRUB rebrand, plymouth detail-text boot.
- Hardening: SELinux enforcing, USBGuard default-block, fail2ban +
auditd, firewalld drop zone, NTS chrony, DNS-over-TLS, locked
root.
- Self-hosted **Forgejo CI** on nullstone replaces the GitHub
Actions build pipeline.
### Fixed (delta from v0.2.5 → v0.5.0 — 35+ failure classes)
The full v0.5.x grind is documented per-release in commit messages
(v0.5.21v0.5.32). Headline fixes:
- **`--location=none` skipped `CollectKernelArgumentsTask`.** Anaconda
shipped BLS entries with empty cmdline. Fix: write
`/etc/kernel/cmdline` directly + `/etc/default/grub` + grubby +
explicit `kernel-install add`. (v0.5.31)
- **`transaction_progress.py` install scroll** masked real failures
when patched too broadly. Narrowed the patch to only suppress
`Configuring xxx.x86_64`. (v0.5.28 → v0.5.29)
- **Locale dialog raced anaconda startup.** Lock to en_US.UTF-8,
defer locale choice to `veilor-postinstall` (v0.7 scope). (v0.5.28)
- **`fbcon=nodefer`** + GRUB rebrand + ASCII gum cursor make the
install flow legible on linux fbcon. (v0.5.27)
- **`rd.luks.uuid`** injected via `grubby --update-kernel=ALL` in
chroot `%post` — earlier releases relied on Anaconda which silently
dropped it. (v0.5.23, v0.5.27)
- **9-agent research wave** identified the v0.5.32 blocker map; 7
blockers shipped in one bundle.
### Notes
- Treat v0.5.0 as the **portfolio anchor** for the kickstart path.
v0.5.32-rc was the last test-run; v0.5.0 was tagged on
2026-05-06 as the freeze point.
- v0.6 was **cancelled** the same day (folded into v0.7). See
`docs/ROADMAP.md` strategy-pivot section.
---
## [0.2.5] — 2026-05-01
**Commit:** `8515bdb`
### Fixed
- **Live boot took 5+ minutes on KVM.** Dracut sat at the parse-livenet
stage for what looked like a hang. Root cause: `init_on_alloc=1`
and `init_on_free=1` zero every memory page on allocation and free.
In a virtualised guest with paravirtual memory, this multiplied the
early-boot cost by ~5x. Removed both flags from the *live* kernel
cmdline.
### Notes
- The two memory-hygiene flags will be re-added on the **installed**
system via `veilor-firstboot` in v0.3 — the cost on bare metal is
negligible, the live-ISO penalty is the only place it bites.
- Live cmdline retained: `lockdown=integrity slab_nomerge
randomize_kstack_offset=on vsyscall=none`.
---
## [0.2.4] — 2026-05-01
**Commit:** `a23ce63`
### Fixed
- **VM booted but stalled at dracut "parse-livenet" looking for a label
that never matched.** Root cause: an upstream bug in
`livecd-tools``imgcreate/live.py::__get_efi_image_stanza()` writes
the EFI grub stanza as `root=live:LABEL=...` for dracut. Dracut on
live ISOs expects `live:CDLABEL=...` for ISO9660 volume labels;
`LABEL=` matches partition labels which a live ISO doesn't have.
- Patched `live.py` in-place inside the CI build container before
invoking `livecd-creator`. With the patched stanza, the VM booted
cleanly to the SDDM login prompt.
### Changed
- CI workflow now `sed`s the patch into the installed `live.py` and
asserts the patch landed before continuing the build.
### Notes
- Bug also affects `livemedia-creator --make-iso --no-virt` and any
other consumer of `imgcreate.LiveImageCreator`. Worth filing
upstream once we have a clean repro recipe.
---
## [0.2.3] — 2026-05-01
**Commit:** `ef54a24`
### Added
- Manual `useradd admin` invocation in chroot `%post`. `livecd-creator`
does not run an installer phase, so the kickstart `user` directive
is silently ignored. Without this, the booted live system has no
admin account at all, and SDDM falls back to "no users" — login
impossible.
### Fixed
- **`/etc/os-release` was still pointing at stock Fedora.** Even with
the overlay tree successfully copied, `kde-theme-apply.sh` was
resolving `/etc/os-release.d/veilor` from the wrong path (the build
host's repo, not the overlay's installed location).
- Rewired the symlink chain cleanly: `/etc/os-release →
../usr/lib/os-release`, with the override file written to
`/usr/lib/os-release` directly during `%post`.
- Branding now reflects veilor-os in `/etc/os-release`,
`hostnamectl`, and the SDDM session menu.
### Notes
- The `user --name=admin` directive stays in the kickstart for
documentation and for any future `livemedia-creator`-based
installer ISO that *does* honour it.
---
## [0.2.2] — 2026-05-01
**Commit:** `3408841`
### Fixed
- **Overlay was partially copied — boot worked but veilor-power, KDE
theme, custom scripts were all missing.** Found via offline debugfs
inspection of the v0.2.1 rootfs: tuned profiles, sshd hardening,
sudoers entries, and systemd units were present, but
`/usr/share/veilor-os/{assets,scripts}` was empty.
- Root cause: `%post --nochroot` ran with `set -eu`. When the first
`cp` of a non-essential overlay file returned non-zero, the script
aborted, leaving the assets/scripts copy step un-executed. None of
the chroot `%post` scripts could then find what they needed and they
silently no-op'd.
### Changed
- `%post --nochroot` now uses `set +e` around `cp`/`mkdir` so a
partial-permissions error on one tree doesn't kill the whole copy.
- Added `/var/log/veilor-nochroot.log` — every action in
`%post --nochroot` now traces with timestamps. Future debugging is
one `journalctl --boot` away.
### Notes
- The looser error handling is intentional but bounded — only the
overlay copy uses `set +e`. Hardening scripts that follow run with
strict mode.
---
## [0.2.1] — 2026-05-01
**Commit:** `9c6136f`
### Fixed
- **ISO booted, but it was effectively bare Fedora KDE.** No
hardening, no theme, no `veilor-power`, no `/etc/os-release`
override. Confirmed by mounting v0.2.0 with debugfs:
`/etc/os-release` symlinked to `../usr/lib/os-release` (Fedora's
default), no `/usr/share/veilor-os`, no overlay files anywhere.
- Root cause: `%post --nochroot` hardcoded `/mnt/sysimage` as the
destination. `/mnt/sysimage` is the **livemedia-creator** install
root. We had switched the build pipeline to **livecd-creator**,
which exposes the destination as `$INSTALL_ROOT` — a different path
inside its tmpfs sandbox.
- Switched the copy target to `$INSTALL_ROOT`.
### Notes
- Partial overlay landed in v0.2.1 (tuned, sshd, sddm.conf) — but
`/usr/share/veilor-os/{assets,scripts}` was still missing because
`set -eu` aborted partway through the cp tree. That fix is in v0.2.2.
- Lesson learned: tooling-specific environment variables matter.
`$INSTALL_ROOT` is the portable answer; `/mnt/sysimage` is a
livemedia-creator-only convention.
---
## [0.2.0] — 2026-04-30
**Commit:** `7c4a94d` (tagged release)
### Added
- First green ISO. Reproducible build pipeline lands.
- GitHub Actions workflow `build-iso.yml` produces a UEFI+BIOS-bootable
live ISO from `kickstart/veilor-os.ks`.
- CI: kickstart syntax linting (`ksvalidator`) gate.
- Kickstart based on Fedora 43, KDE Plasma minimal, hardening
packages selected (`fail2ban`, `usbguard`, `tuned`, `audit`,
`firewalld`).
- Overlay tree authored: tuned profiles, sshd hardening, sysctl
drop-in, sudoers, udev rules, KDE theme assets, Fira Code font.
- 3-mode power profiles: `veilor-power save | mid | perf` with
AC/battery udev auto-switching.
### Notes — known limitations of v0.2.0
- **The overlay never actually applied to the installed system.**
The `%post --nochroot` copy step targeted `/mnt/sysimage`
(livemedia-creator's install root) but the build pipeline had moved
to livecd-creator, which uses `$INSTALL_ROOT`. Result: the ISO
*boots* and presents a working KDE Plasma desktop, but it is in
practice **stock Fedora 43 KDE** with no veilor-os hardening,
branding, theme, or power scripts applied.
- v0.2.0 is best understood as a **build-pipeline milestone** — the
ISO format, EFI/BIOS bootability, partitioning, and squashfs build
all work end-to-end. The userspace customisation layer was wired
but not delivering. Treat v0.2.0 as proof-of-build, not as a
feature-complete release.
- See **v0.2.5** for the first feature-complete ISO that actually
ships veilor-os hardening and branding into the running system.
### Build pipeline path to green
For posterity, the issues resolved between v0.1 (scaffold) and v0.2.0
(first green ISO):
- pcre2 / selinux-policy version skew on stock Fedora 43 base —
worked around with a pinned `fix-repo` for the local build only;
CI uses `dnf upgrade --refresh` to sidestep entirely.
- KDE Plasma hard-deps (cups, geoclue2, ModemManager, PackageKit) —
kept at the package level, masked at the daemon level.
- `%post --nochroot` source path — multi-path detection added so the
overlay can be sourced from `/work` (CI) or `/run/install/repo`
(virt) or kickstart-relative (no-virt).
- `livemedia-creator --make-iso --no-virt` produced a squashfs but
no EFI/BOOT image. Switched to `livecd-creator` (`livecd-tools`)
which is purpose-built for live ISOs and handles EFI grafting.
- Tmpdir on `/tmp` exhausted the GitHub Actions tmpfs cap (16GB
vs ~30GB working set). Moved to `/var/lmc` on the runner's host
ext4.
---
## [0.1.0] — 2026-04-29
**Commit:** `1822005`
### Added
- Initial repo scaffold: `kickstart/`, `build/`, `overlay/`, `scripts/`,
`assets/`, `docs/`, `test/`.
- Kickstart skeleton (Fedora 43 KDE base, single-prompt LUKS install,
hardened bootloader cmdline, locked root, blank-password admin with
`chage -d 0` to force first-boot reset).
- Hardening scripts ported and rebranded from operator's reference
system: base hardening, kernel hardening, custom SELinux policy
module `veilor-systemd`.
- KDE theme: BreezeBlackPure base + grey accent (`#686B6F`).
- Fira Code chosen as system font (Fedora `fira-code-fonts`,
SIL OFL 1.1).
- Test harness: VM runner (`test/run-vm.sh`) with QEMU + OVMF for
fast iteration, with `SECBOOT=1` and `FRESH=1` modes.
- Documentation: `BUILD.md`, `INSTALL.md`, `HARDENING.md`,
`POWER.md`, `boot-checklist.md`.
### Notes
- v0.1 was scaffold-only — no green ISO yet. Build pipeline iterated
through ~22 distinct toolchain issues before producing v0.2.0.
- All `onyx` references stripped from shipped artifacts; comments
refer to "reference system" only.