legacy-arrflix/playbooks/subtitles/STOPGAP-SUBS.md
s8n 7eb5f346fd
Some checks are pending
secret-scan / gitleaks (HEAD + history) (push) Waiting to run
secret-scan / detect-secrets (entropy + cross-tool) (push) Waiting to run
secret-scan / summary (push) Blocked by required conditions
subs: add source-priority tier ladder, accept original-release bitmap as tier 2
Original-release bitmap subs (PGS, VobSub, dvd_subtitle) are first-class,
not stop-gaps. They're the canonical studio render — bitmap encoding is
just a format choice, not a quality compromise. OCR'd or AI-rebuilt
sidecars introduce transcription error that the source doesn't have.

STYLE.md changes:
- New "Source priority" section with 4 tiers: original text > original
  bitmap > trusted text rips > WhisperX rebuild.
- "What lands on disk" loosened: at least one English stream (embedded
  OR sidecar), keep embedded codec as-is, sidecar still .srt.
- New "OCR bitmap -> text" section documenting pgsrip recipe as an
  optional UX-nicety augmentation, not a correctness fix.
- "Why these rules" now explains why original > pretty (esp. for older
  shows like Futurama S1-3 / early Archer where the master is the only
  authoritative source and upscale artifacts already dominate).

STOPGAP-SUBS.md: header note clarifying bitmap-from-disc is NOT a
stop-gap; lists Lilo & Stitch (2002) and Archer (2009) S02 as examples
of correct-as-shipped library entries.
2026-05-10 21:22:32 +01:00

56 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Stop-gap subs — pending Whisper cross-ref
Shows whose current subtitles ship from a path that explicitly violates
[`STYLE.md`](STYLE.md). Quality is "acceptable, not great" (~85 %). When
v4 WhisperX (ROADMAP H5) lands on the friend RTX 4080 node, **regenerate
every show on this list** with proper-noun-prompted transcription and
replace the sidecars in place. Keep this file as the v4 worklist.
**NOT a stop-gap** (do NOT log here): embedded original-release bitmap
subs (PGS, VobSub, `dvd_subtitle`). Per [`STYLE.md`](STYLE.md) tier 2,
those are first-class — they're the original studio render and ship
as-is. Examples currently in library that are correct, not stop-gap:
- Lilo & Stitch (2002) — 2× embedded English PGS
- Archer (2009) S02 — 3× embedded DVD-bitmap (eng/spa/fre)
Optional `pgsrip` OCR sidecar for those is a UX nicety, not a
correctness fix — see STYLE.md "OCR bitmap → text".
## Active stop-gaps
| Show | Eps subbed | Source path | Why stop-gap | Owner verdict | Logged |
|---|---|---|---|---|---|
| Sassy the Sasquatch (2022) | S01 5/5 | v3.5 YouTube auto-CC | lowercase, no punctuation, names mangled (`Sassy → sasha`), profanity = `[ __ ]` | "85 % the way there, acceptable, fine" — keep until v4 | 2026-05-10 |
## When more Big Lez universe shows ship via v3.5
Same channel hosts these — when subbed via the v3.5 yt-dlp path, append
to the table above:
- The Donny & Clarence Show (2024)
- The Big Lez Saga (2022)
- The Mike Nolan Show (2016) — but **try the YT "COMPLETE SEASON | SUBTITLES"
upload first** for hand-typed CCs before falling back to auto-CC
## v4 WhisperX rebuild plan
When the friend node (`100.64.0.3`, per memory `project_friend_gpu.md`) is
back online:
1. Install WhisperX on the node (CUDA 12 + cuDNN 9 + faster-whisper +
pyannote VAD).
2. For each show in the table above, write
`playbooks/subtitles/prompts/<show>.yaml` with the recurring proper
nouns the YT auto-CC mangled.
3. Run `lib/sub-whisperx-fetch.py` (TBD, ROADMAP H5) per show. Each
episode: pull mkv → ffmpeg extract 16k mono wav → WhisperX large-v3
with `--initial_prompt` from the yaml → SRT → SSH push to nullstone
with library filename, **overwriting the v3.5 sidecar in place**.
4. Tick off the row from the table; move it to a "Cleared via v4" archive
section below this one (kept as record).
5. Library scan; verify Jellyfin still reports 1 external eng sub stream
per ep (no dupes from v3.5 + v4 stacking).
## Cleared via v4 (archive)
(empty — populate as v4 rebuilds land)