subs: add source-priority tier ladder, accept original-release bitmap as tier 2
Some checks are pending
secret-scan / gitleaks (HEAD + history) (push) Waiting to run
secret-scan / detect-secrets (entropy + cross-tool) (push) Waiting to run
secret-scan / summary (push) Blocked by required conditions

Original-release bitmap subs (PGS, VobSub, dvd_subtitle) are first-class,
not stop-gaps. They're the canonical studio render — bitmap encoding is
just a format choice, not a quality compromise. OCR'd or AI-rebuilt
sidecars introduce transcription error that the source doesn't have.

STYLE.md changes:
- New "Source priority" section with 4 tiers: original text > original
  bitmap > trusted text rips > WhisperX rebuild.
- "What lands on disk" loosened: at least one English stream (embedded
  OR sidecar), keep embedded codec as-is, sidecar still .srt.
- New "OCR bitmap -> text" section documenting pgsrip recipe as an
  optional UX-nicety augmentation, not a correctness fix.
- "Why these rules" now explains why original > pretty (esp. for older
  shows like Futurama S1-3 / early Archer where the master is the only
  authoritative source and upscale artifacts already dominate).

STOPGAP-SUBS.md: header note clarifying bitmap-from-disc is NOT a
stop-gap; lists Lilo & Stitch (2002) and Archer (2009) S02 as examples
of correct-as-shipped library entries.
This commit is contained in:
s8n 2026-05-10 21:22:32 +01:00
parent 5b80cfd095
commit 7eb5f346fd
2 changed files with 81 additions and 10 deletions

View file

@ -6,6 +6,16 @@ v4 WhisperX (ROADMAP H5) lands on the friend RTX 4080 node, **regenerate
every show on this list** with proper-noun-prompted transcription and
replace the sidecars in place. Keep this file as the v4 worklist.
**NOT a stop-gap** (do NOT log here): embedded original-release bitmap
subs (PGS, VobSub, `dvd_subtitle`). Per [`STYLE.md`](STYLE.md) tier 2,
those are first-class — they're the original studio render and ship
as-is. Examples currently in library that are correct, not stop-gap:
- Lilo & Stitch (2002) — 2× embedded English PGS
- Archer (2009) S02 — 3× embedded DVD-bitmap (eng/spa/fre)
Optional `pgsrip` OCR sidecar for those is a UX nicety, not a
correctness fix — see STYLE.md "OCR bitmap → text".
## Active stop-gaps
| Show | Eps subbed | Source path | Why stop-gap | Owner verdict | Logged |

View file

@ -3,17 +3,41 @@
The bar every fetch should hit. If a recipe step would violate any of these,
stop and ask before proceeding.
## Source priority (highest → lowest)
Accuracy beats format. Use this tier ladder before reaching for OCR/AI:
1. **Original release text subs** (`.srt`/`.ass` from disc/streamer rip,
embedded or sidecar). Ground truth — ship as-is.
2. **Original release bitmap subs** (PGS, VobSub, `dvd_subtitle`,
embedded). **Acceptable in their native form** — they ARE the original
words from the source master, just rendered as images. Jellyfin server
burns them in for clients that can't render natively. Optionally
OCR-extract a `.srt` sidecar alongside (do NOT replace the embedded
stream) when client-side styling, search, or mobile rendering matters.
3. **Trusted text rips** from OpenSubtitles (verified uploads, hash-match
or high-download-count + frame-rate-match).
4. **WhisperX rebuild** with `--initial_prompt` proper-nouns yaml — only
when no original exists (e.g. user-uploaded YT content with auto-CC).
Logged in [`STOPGAP-SUBS.md`](STOPGAP-SUBS.md) until cleared.
Tier 1 and 2 are first-class. Tier 3 is a fallback. Tier 4 is a stop-gap.
## What lands on disk
- **Exactly one** English subtitle file per episode.
- Filename: `<videobasename>.eng.srt` — no language-region tags (`en-US`),
no flag stack on regular subs (no `.sdh`, no `.forced`, no `.cc` unless
there genuinely is no plain-English option).
- Format: `.srt` (SubRip text). Skip `.ass`, `.ssa`, `.vtt`, `.sup`, `.idx`
unless the source has nothing else; convert with `ffmpeg -map 0:s:0 -c:s srt`
in that case.
- Encoding: UTF-8. Re-encode with `iconv` if a sidecar comes back as cp1252
/ windows-1250.
- **At least one** English subtitle stream per episode (embedded OR
sidecar — not both required).
- Sidecar filename when used: `<videobasename>.eng.srt` — no
language-region tags (`en-US`), no flag stack on regular subs (no
`.sdh`, no `.forced`, no `.cc` unless there genuinely is no
plain-English option).
- Sidecar format: `.srt` (SubRip text). For embedded: keep the original
codec (`subrip`, `ass`, `pgs`, `vobsub`, `dvd_subtitle`) — do NOT
re-mux just to convert format. Convert only when extracting to disk:
text codecs via `ffmpeg -map 0:s:0 -c:s srt`, bitmap codecs via OCR
(see "OCR bitmap → text" below).
- Encoding (sidecar): UTF-8. Re-encode with `iconv` if a sidecar comes
back as cp1252 / windows-1250.
## What gets picked
@ -40,12 +64,41 @@ After fetch, **eyeball-verify one sample episode per show** plays in sync
## What doesn't ship
- Multiple language tracks per episode (no German/French alternatives —
English-only library).
English-only library). Drop non-English embedded streams via
`mkvpropedit` only if user complains about client picker clutter; do
NOT silently strip them on import.
- Director's commentary, behind-the-scenes, song-only subs.
- Subs that cover only a partial runtime (the partial-cover heuristic isn't
scripted yet; spot-check duration vs episode runtime if a srt looks short).
- "All-episodes-in-one" mega-packs treated as a single episode's sidecar.
## OCR bitmap → text (optional, tier-2 augmentation)
Embedded PGS/VobSub/`dvd_subtitle` are acceptable as-is (tier 2). OCR
becomes worthwhile when: (a) client repeatedly transcodes due to bitmap
burn-in (CPU pressure on nullstone — no GPU transcode available), (b)
user wants to restyle font/size on a specific show, (c) mobile client
renders bitmap subs poorly.
Recipe (`pgsrip`, batch-friendly, Tesseract-backed):
```bash
pip install pgsrip
# PGS: extract embedded to .sup
ffmpeg -i input.mkv -map 0:s:0 -c copy subs.sup
pgsrip --language eng subs.sup # -> subs.srt
# VobSub/dvd_subtitle: extract to .idx + .sub
mkvextract input.mkv tracks 2:subs.idx
pgsrip subs.idx # -> subs.srt
```
OCR accuracy ~9095 % raw, ~9598 % after Subtitle Edit cleanup. Source
words are correct (it's transcription of original render, not Whisper
hallucination) — only font recognition fights you. Resulting `.srt`
ships as sidecar **alongside** the embedded bitmap stream, not as a
replacement.
## How the UI presents subs
The detail-page subtitle dropdown is shimmed via
@ -62,6 +115,14 @@ Revert: `bin/revert-sub-label-shim.sh`.
- Boutique-release-group quality bar from
[`README.md`](../../README.md): "every show and film is the best version
I could put together."
- **Original-release subs > pretty format.** The DVD/BD/streamer master
is the canonical script — bitmap or text, those are the words the
studio shipped. An OCR'd or AI-rebuilt sidecar is a derivative that
introduces error (font confusion, mistranscription); the original
doesn't. Especially true for older shows (Futurama S1S3, Archer
early seasons) where the master is the only authoritative source and
upscale artifacts already dominate the visual experience — bitmap
subs match the source vibe.
- One-language library = one stream per ep = no need to expose codec or
source in UI.
- SDH/CC adds `[door slams]`, `[music]` etc. — distracting on first watch