legacy-arrflix/playbooks/subtitles/STYLE.md
s8n 7eb5f346fd
Some checks are pending
secret-scan / gitleaks (HEAD + history) (push) Waiting to run
secret-scan / detect-secrets (entropy + cross-tool) (push) Waiting to run
secret-scan / summary (push) Blocked by required conditions
subs: add source-priority tier ladder, accept original-release bitmap as tier 2
Original-release bitmap subs (PGS, VobSub, dvd_subtitle) are first-class,
not stop-gaps. They're the canonical studio render — bitmap encoding is
just a format choice, not a quality compromise. OCR'd or AI-rebuilt
sidecars introduce transcription error that the source doesn't have.

STYLE.md changes:
- New "Source priority" section with 4 tiers: original text > original
  bitmap > trusted text rips > WhisperX rebuild.
- "What lands on disk" loosened: at least one English stream (embedded
  OR sidecar), keep embedded codec as-is, sidecar still .srt.
- New "OCR bitmap -> text" section documenting pgsrip recipe as an
  optional UX-nicety augmentation, not a correctness fix.
- "Why these rules" now explains why original > pretty (esp. for older
  shows like Futurama S1-3 / early Archer where the master is the only
  authoritative source and upscale artifacts already dominate).

STOPGAP-SUBS.md: header note clarifying bitmap-from-disc is NOT a
stop-gap; lists Lilo & Stitch (2002) and Archer (2009) S02 as examples
of correct-as-shipped library entries.
2026-05-10 21:22:32 +01:00

134 lines
6.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Subtitle USER-G style — ARRFLIX
The bar every fetch should hit. If a recipe step would violate any of these,
stop and ask before proceeding.
## Source priority (highest → lowest)
Accuracy beats format. Use this tier ladder before reaching for OCR/AI:
1. **Original release text subs** (`.srt`/`.ass` from disc/streamer rip,
embedded or sidecar). Ground truth — ship as-is.
2. **Original release bitmap subs** (PGS, VobSub, `dvd_subtitle`,
embedded). **Acceptable in their native form** — they ARE the original
words from the source master, just rendered as images. Jellyfin server
burns them in for clients that can't render natively. Optionally
OCR-extract a `.srt` sidecar alongside (do NOT replace the embedded
stream) when client-side styling, search, or mobile rendering matters.
3. **Trusted text rips** from OpenSubtitles (verified uploads, hash-match
or high-download-count + frame-rate-match).
4. **WhisperX rebuild** with `--initial_prompt` proper-nouns yaml — only
when no original exists (e.g. user-uploaded YT content with auto-CC).
Logged in [`STOPGAP-SUBS.md`](STOPGAP-SUBS.md) until cleared.
Tier 1 and 2 are first-class. Tier 3 is a fallback. Tier 4 is a stop-gap.
## What lands on disk
- **At least one** English subtitle stream per episode (embedded OR
sidecar — not both required).
- Sidecar filename when used: `<videobasename>.eng.srt` — no
language-region tags (`en-US`), no flag stack on regular subs (no
`.sdh`, no `.forced`, no `.cc` unless there genuinely is no
plain-English option).
- Sidecar format: `.srt` (SubRip text). For embedded: keep the original
codec (`subrip`, `ass`, `pgs`, `vobsub`, `dvd_subtitle`) — do NOT
re-mux just to convert format. Convert only when extracting to disk:
text codecs via `ffmpeg -map 0:s:0 -c:s srt`, bitmap codecs via OCR
(see "OCR bitmap → text" below).
- Encoding (sidecar): UTF-8. Re-encode with `iconv` if a sidecar comes
back as cp1252 / windows-1250.
## What gets picked
In order:
1. **English** language — `eng` / `en`. Never auto-pick `en-US`/`en-GB`
variants over plain `en`; treat them equivalent for matching.
2. **No SDH / Hearing Impaired** — drop any sub flagged `hearing_impaired`,
`sdh`, `cc`. Only fall back to SDH if no plain-English option exists.
3. **No machine / AI translation** — drop `machine_translated`,
`ai_translated`. Hand-authored subs only.
4. **No forced subtitles** — drop `foreign_parts_only` / `Forced` unless the
episode has English audio with foreign-language scenes that need
translation (rare for US shows).
5. **Frame-rate match** — prefer entries whose declared fps matches the
source video (typically 23.976 for our masters). Treat `0.0` as unknown
and fall through to step 6.
6. **Highest download count** within the surviving candidates — proxies for
"the version everyone agreed was best."
After fetch, **eyeball-verify one sample episode per show** plays in sync
(±1 s on a known dialogue line) before declaring the show done.
## What doesn't ship
- Multiple language tracks per episode (no German/French alternatives —
English-only library). Drop non-English embedded streams via
`mkvpropedit` only if user complains about client picker clutter; do
NOT silently strip them on import.
- Director's commentary, behind-the-scenes, song-only subs.
- Subs that cover only a partial runtime (the partial-cover heuristic isn't
scripted yet; spot-check duration vs episode runtime if a srt looks short).
- "All-episodes-in-one" mega-packs treated as a single episode's sidecar.
## OCR bitmap → text (optional, tier-2 augmentation)
Embedded PGS/VobSub/`dvd_subtitle` are acceptable as-is (tier 2). OCR
becomes worthwhile when: (a) client repeatedly transcodes due to bitmap
burn-in (CPU pressure on nullstone — no GPU transcode available), (b)
user wants to restyle font/size on a specific show, (c) mobile client
renders bitmap subs poorly.
Recipe (`pgsrip`, batch-friendly, Tesseract-backed):
```bash
pip install pgsrip
# PGS: extract embedded to .sup
ffmpeg -i input.mkv -map 0:s:0 -c copy subs.sup
pgsrip --language eng subs.sup # -> subs.srt
# VobSub/dvd_subtitle: extract to .idx + .sub
mkvextract input.mkv tracks 2:subs.idx
pgsrip subs.idx # -> subs.srt
```
OCR accuracy ~9095 % raw, ~9598 % after Subtitle Edit cleanup. Source
words are correct (it's transcription of original render, not Whisper
hallucination) — only font recognition fights you. Resulting `.srt`
ships as sidecar **alongside** the embedded bitmap stream, not as a
replacement.
## How the UI presents subs
The detail-page subtitle dropdown is shimmed via
`web-overrides/index.html` (markers `SUB-LABEL-SHIM-BEGIN/END`). Stock
Jellyfin shows e.g. `English - SUBRIP - External - Default`; the shim
collapses to `English`, with `(Forced)` / `(SDH)` / `(Hearing Impaired)`
suffixes only when those flags actually apply. `Default` is dropped — it's
redundant when there's only one stream per language.
Revert: `bin/revert-sub-label-shim.sh`.
## Why these rules
- Boutique-release-group quality bar from
[`README.md`](../../README.md): "every show and film is the best version
I could put together."
- **Original-release subs > pretty format.** The DVD/BD/streamer master
is the canonical script — bitmap or text, those are the words the
studio shipped. An OCR'd or AI-rebuilt sidecar is a derivative that
introduces error (font confusion, mistranscription); the original
doesn't. Especially true for older shows (Futurama S1S3, Archer
early seasons) where the master is the only authoritative source and
upscale artifacts already dominate the visual experience — bitmap
subs match the source vibe.
- One-language library = one stream per ep = no need to expose codec or
source in UI.
- SDH/CC adds `[door slams]`, `[music]` etc. — distracting on first watch
and not what someone reaches for unless they specifically need it.
- Machine / AI translations are inconsistent and often wrong on slang or
show-specific terms (esp. animated comedies).
- Frame-rate-matched subs sync without manual offset on first try; mismatch
generally still works on NTSC (29.97 vs 23.976 are the same elapsed time)
but hash-match or fps-match removes that gamble.