Original-release bitmap subs (PGS, VobSub, dvd_subtitle) are first-class, not stop-gaps. They're the canonical studio render — bitmap encoding is just a format choice, not a quality compromise. OCR'd or AI-rebuilt sidecars introduce transcription error that the source doesn't have. STYLE.md changes: - New "Source priority" section with 4 tiers: original text > original bitmap > trusted text rips > WhisperX rebuild. - "What lands on disk" loosened: at least one English stream (embedded OR sidecar), keep embedded codec as-is, sidecar still .srt. - New "OCR bitmap -> text" section documenting pgsrip recipe as an optional UX-nicety augmentation, not a correctness fix. - "Why these rules" now explains why original > pretty (esp. for older shows like Futurama S1-3 / early Archer where the master is the only authoritative source and upscale artifacts already dominate). STOPGAP-SUBS.md: header note clarifying bitmap-from-disc is NOT a stop-gap; lists Lilo & Stitch (2002) and Archer (2009) S02 as examples of correct-as-shipped library entries.
56 lines
2.5 KiB
Markdown
56 lines
2.5 KiB
Markdown
# Stop-gap subs — pending Whisper cross-ref
|
||
|
||
Shows whose current subtitles ship from a path that explicitly violates
|
||
[`STYLE.md`](STYLE.md). Quality is "acceptable, not great" (~85 %). When
|
||
v4 WhisperX (ROADMAP H5) lands on the friend RTX 4080 node, **regenerate
|
||
every show on this list** with proper-noun-prompted transcription and
|
||
replace the sidecars in place. Keep this file as the v4 worklist.
|
||
|
||
**NOT a stop-gap** (do NOT log here): embedded original-release bitmap
|
||
subs (PGS, VobSub, `dvd_subtitle`). Per [`STYLE.md`](STYLE.md) tier 2,
|
||
those are first-class — they're the original studio render and ship
|
||
as-is. Examples currently in library that are correct, not stop-gap:
|
||
- Lilo & Stitch (2002) — 2× embedded English PGS
|
||
- Archer (2009) S02 — 3× embedded DVD-bitmap (eng/spa/fre)
|
||
|
||
Optional `pgsrip` OCR sidecar for those is a UX nicety, not a
|
||
correctness fix — see STYLE.md "OCR bitmap → text".
|
||
|
||
## Active stop-gaps
|
||
|
||
| Show | Eps subbed | Source path | Why stop-gap | Owner verdict | Logged |
|
||
|---|---|---|---|---|---|
|
||
| Sassy the Sasquatch (2022) | S01 5/5 | v3.5 YouTube auto-CC | lowercase, no punctuation, names mangled (`Sassy → sasha`), profanity = `[ __ ]` | "85 % the way there, acceptable, fine" — keep until v4 | 2026-05-10 |
|
||
|
||
## When more Big Lez universe shows ship via v3.5
|
||
|
||
Same channel hosts these — when subbed via the v3.5 yt-dlp path, append
|
||
to the table above:
|
||
|
||
- The Donny & Clarence Show (2024)
|
||
- The Big Lez Saga (2022)
|
||
- The Mike Nolan Show (2016) — but **try the YT "COMPLETE SEASON | SUBTITLES"
|
||
upload first** for hand-typed CCs before falling back to auto-CC
|
||
|
||
## v4 WhisperX rebuild plan
|
||
|
||
When the friend node (`100.64.0.3`, per memory `project_friend_gpu.md`) is
|
||
back online:
|
||
|
||
1. Install WhisperX on the node (CUDA 12 + cuDNN 9 + faster-whisper +
|
||
pyannote VAD).
|
||
2. For each show in the table above, write
|
||
`playbooks/subtitles/prompts/<show>.yaml` with the recurring proper
|
||
nouns the YT auto-CC mangled.
|
||
3. Run `lib/sub-whisperx-fetch.py` (TBD, ROADMAP H5) per show. Each
|
||
episode: pull mkv → ffmpeg extract 16k mono wav → WhisperX large-v3
|
||
with `--initial_prompt` from the yaml → SRT → SSH push to nullstone
|
||
with library filename, **overwriting the v3.5 sidecar in place**.
|
||
4. Tick off the row from the table; move it to a "Cleared via v4" archive
|
||
section below this one (kept as record).
|
||
5. Library scan; verify Jellyfin still reports 1 external eng sub stream
|
||
per ep (no dupes from v3.5 + v4 stacking).
|
||
|
||
## Cleared via v4 (archive)
|
||
|
||
(empty — populate as v4 rebuilds land)
|