legacy-arrflix/playbooks/subtitles/STOPGAP-SUBS.md
s8n 7eb5f346fd
Some checks are pending
secret-scan / gitleaks (HEAD + history) (push) Waiting to run
secret-scan / detect-secrets (entropy + cross-tool) (push) Waiting to run
secret-scan / summary (push) Blocked by required conditions
subs: add source-priority tier ladder, accept original-release bitmap as tier 2
Original-release bitmap subs (PGS, VobSub, dvd_subtitle) are first-class,
not stop-gaps. They're the canonical studio render — bitmap encoding is
just a format choice, not a quality compromise. OCR'd or AI-rebuilt
sidecars introduce transcription error that the source doesn't have.

STYLE.md changes:
- New "Source priority" section with 4 tiers: original text > original
  bitmap > trusted text rips > WhisperX rebuild.
- "What lands on disk" loosened: at least one English stream (embedded
  OR sidecar), keep embedded codec as-is, sidecar still .srt.
- New "OCR bitmap -> text" section documenting pgsrip recipe as an
  optional UX-nicety augmentation, not a correctness fix.
- "Why these rules" now explains why original > pretty (esp. for older
  shows like Futurama S1-3 / early Archer where the master is the only
  authoritative source and upscale artifacts already dominate).

STOPGAP-SUBS.md: header note clarifying bitmap-from-disc is NOT a
stop-gap; lists Lilo & Stitch (2002) and Archer (2009) S02 as examples
of correct-as-shipped library entries.
2026-05-10 21:22:32 +01:00

2.5 KiB
Raw Blame History

Stop-gap subs — pending Whisper cross-ref

Shows whose current subtitles ship from a path that explicitly violates STYLE.md. Quality is "acceptable, not great" (~85 %). When v4 WhisperX (ROADMAP H5) lands on the friend RTX 4080 node, regenerate every show on this list with proper-noun-prompted transcription and replace the sidecars in place. Keep this file as the v4 worklist.

NOT a stop-gap (do NOT log here): embedded original-release bitmap subs (PGS, VobSub, dvd_subtitle). Per STYLE.md tier 2, those are first-class — they're the original studio render and ship as-is. Examples currently in library that are correct, not stop-gap:

  • Lilo & Stitch (2002) — 2× embedded English PGS
  • Archer (2009) S02 — 3× embedded DVD-bitmap (eng/spa/fre)

Optional pgsrip OCR sidecar for those is a UX nicety, not a correctness fix — see STYLE.md "OCR bitmap → text".

Active stop-gaps

Show Eps subbed Source path Why stop-gap Owner verdict Logged
Sassy the Sasquatch (2022) S01 5/5 v3.5 YouTube auto-CC lowercase, no punctuation, names mangled (Sassy → sasha), profanity = [ __ ] "85 % the way there, acceptable, fine" — keep until v4 2026-05-10

When more Big Lez universe shows ship via v3.5

Same channel hosts these — when subbed via the v3.5 yt-dlp path, append to the table above:

  • The Donny & Clarence Show (2024)
  • The Big Lez Saga (2022)
  • The Mike Nolan Show (2016) — but try the YT "COMPLETE SEASON | SUBTITLES" upload first for hand-typed CCs before falling back to auto-CC

v4 WhisperX rebuild plan

When the friend node (100.64.0.3, per memory project_friend_gpu.md) is back online:

  1. Install WhisperX on the node (CUDA 12 + cuDNN 9 + faster-whisper + pyannote VAD).
  2. For each show in the table above, write playbooks/subtitles/prompts/<show>.yaml with the recurring proper nouns the YT auto-CC mangled.
  3. Run lib/sub-whisperx-fetch.py (TBD, ROADMAP H5) per show. Each episode: pull mkv → ffmpeg extract 16k mono wav → WhisperX large-v3 with --initial_prompt from the yaml → SRT → SSH push to nullstone with library filename, overwriting the v3.5 sidecar in place.
  4. Tick off the row from the table; move it to a "Cleared via v4" archive section below this one (kept as record).
  5. Library scan; verify Jellyfin still reports 1 external eng sub stream per ep (no dupes from v3.5 + v4 stacking).

Cleared via v4 (archive)

(empty — populate as v4 rebuilds land)