legacy-arrflix/playbooks/subtitles/STYLE.md
s8n 7eb5f346fd
Some checks are pending
secret-scan / gitleaks (HEAD + history) (push) Waiting to run
secret-scan / detect-secrets (entropy + cross-tool) (push) Waiting to run
secret-scan / summary (push) Blocked by required conditions
subs: add source-priority tier ladder, accept original-release bitmap as tier 2
Original-release bitmap subs (PGS, VobSub, dvd_subtitle) are first-class,
not stop-gaps. They're the canonical studio render — bitmap encoding is
just a format choice, not a quality compromise. OCR'd or AI-rebuilt
sidecars introduce transcription error that the source doesn't have.

STYLE.md changes:
- New "Source priority" section with 4 tiers: original text > original
  bitmap > trusted text rips > WhisperX rebuild.
- "What lands on disk" loosened: at least one English stream (embedded
  OR sidecar), keep embedded codec as-is, sidecar still .srt.
- New "OCR bitmap -> text" section documenting pgsrip recipe as an
  optional UX-nicety augmentation, not a correctness fix.
- "Why these rules" now explains why original > pretty (esp. for older
  shows like Futurama S1-3 / early Archer where the master is the only
  authoritative source and upscale artifacts already dominate).

STOPGAP-SUBS.md: header note clarifying bitmap-from-disc is NOT a
stop-gap; lists Lilo & Stitch (2002) and Archer (2009) S02 as examples
of correct-as-shipped library entries.
2026-05-10 21:22:32 +01:00

6.2 KiB
Raw Blame History

Subtitle USER-G style — ARRFLIX

The bar every fetch should hit. If a recipe step would violate any of these, stop and ask before proceeding.

Source priority (highest → lowest)

Accuracy beats format. Use this tier ladder before reaching for OCR/AI:

  1. Original release text subs (.srt/.ass from disc/streamer rip, embedded or sidecar). Ground truth — ship as-is.
  2. Original release bitmap subs (PGS, VobSub, dvd_subtitle, embedded). Acceptable in their native form — they ARE the original words from the source master, just rendered as images. Jellyfin server burns them in for clients that can't render natively. Optionally OCR-extract a .srt sidecar alongside (do NOT replace the embedded stream) when client-side styling, search, or mobile rendering matters.
  3. Trusted text rips from OpenSubtitles (verified uploads, hash-match or high-download-count + frame-rate-match).
  4. WhisperX rebuild with --initial_prompt proper-nouns yaml — only when no original exists (e.g. user-uploaded YT content with auto-CC). Logged in STOPGAP-SUBS.md until cleared.

Tier 1 and 2 are first-class. Tier 3 is a fallback. Tier 4 is a stop-gap.

What lands on disk

  • At least one English subtitle stream per episode (embedded OR sidecar — not both required).
  • Sidecar filename when used: <videobasename>.eng.srt — no language-region tags (en-US), no flag stack on regular subs (no .sdh, no .forced, no .cc unless there genuinely is no plain-English option).
  • Sidecar format: .srt (SubRip text). For embedded: keep the original codec (subrip, ass, pgs, vobsub, dvd_subtitle) — do NOT re-mux just to convert format. Convert only when extracting to disk: text codecs via ffmpeg -map 0:s:0 -c:s srt, bitmap codecs via OCR (see "OCR bitmap → text" below).
  • Encoding (sidecar): UTF-8. Re-encode with iconv if a sidecar comes back as cp1252 / windows-1250.

What gets picked

In order:

  1. English language — eng / en. Never auto-pick en-US/en-GB variants over plain en; treat them equivalent for matching.
  2. No SDH / Hearing Impaired — drop any sub flagged hearing_impaired, sdh, cc. Only fall back to SDH if no plain-English option exists.
  3. No machine / AI translation — drop machine_translated, ai_translated. Hand-authored subs only.
  4. No forced subtitles — drop foreign_parts_only / Forced unless the episode has English audio with foreign-language scenes that need translation (rare for US shows).
  5. Frame-rate match — prefer entries whose declared fps matches the source video (typically 23.976 for our masters). Treat 0.0 as unknown and fall through to step 6.
  6. Highest download count within the surviving candidates — proxies for "the version everyone agreed was best."

After fetch, eyeball-verify one sample episode per show plays in sync (±1 s on a known dialogue line) before declaring the show done.

What doesn't ship

  • Multiple language tracks per episode (no German/French alternatives — English-only library). Drop non-English embedded streams via mkvpropedit only if user complains about client picker clutter; do NOT silently strip them on import.
  • Director's commentary, behind-the-scenes, song-only subs.
  • Subs that cover only a partial runtime (the partial-cover heuristic isn't scripted yet; spot-check duration vs episode runtime if a srt looks short).
  • "All-episodes-in-one" mega-packs treated as a single episode's sidecar.

OCR bitmap → text (optional, tier-2 augmentation)

Embedded PGS/VobSub/dvd_subtitle are acceptable as-is (tier 2). OCR becomes worthwhile when: (a) client repeatedly transcodes due to bitmap burn-in (CPU pressure on nullstone — no GPU transcode available), (b) user wants to restyle font/size on a specific show, (c) mobile client renders bitmap subs poorly.

Recipe (pgsrip, batch-friendly, Tesseract-backed):

pip install pgsrip
# PGS: extract embedded to .sup
ffmpeg -i input.mkv -map 0:s:0 -c copy subs.sup
pgsrip --language eng subs.sup        # -> subs.srt

# VobSub/dvd_subtitle: extract to .idx + .sub
mkvextract input.mkv tracks 2:subs.idx
pgsrip subs.idx                       # -> subs.srt

OCR accuracy ~9095 % raw, ~9598 % after Subtitle Edit cleanup. Source words are correct (it's transcription of original render, not Whisper hallucination) — only font recognition fights you. Resulting .srt ships as sidecar alongside the embedded bitmap stream, not as a replacement.

How the UI presents subs

The detail-page subtitle dropdown is shimmed via web-overrides/index.html (markers SUB-LABEL-SHIM-BEGIN/END). Stock Jellyfin shows e.g. English - SUBRIP - External - Default; the shim collapses to English, with (Forced) / (SDH) / (Hearing Impaired) suffixes only when those flags actually apply. Default is dropped — it's redundant when there's only one stream per language.

Revert: bin/revert-sub-label-shim.sh.

Why these rules

  • Boutique-release-group quality bar from README.md: "every show and film is the best version I could put together."
  • Original-release subs > pretty format. The DVD/BD/streamer master is the canonical script — bitmap or text, those are the words the studio shipped. An OCR'd or AI-rebuilt sidecar is a derivative that introduces error (font confusion, mistranscription); the original doesn't. Especially true for older shows (Futurama S1S3, Archer early seasons) where the master is the only authoritative source and upscale artifacts already dominate the visual experience — bitmap subs match the source vibe.
  • One-language library = one stream per ep = no need to expose codec or source in UI.
  • SDH/CC adds [door slams], [music] etc. — distracting on first watch and not what someone reaches for unless they specifically need it.
  • Machine / AI translations are inconsistent and often wrong on slang or show-specific terms (esp. animated comedies).
  • Frame-rate-matched subs sync without manual offset on first try; mismatch generally still works on NTSC (29.97 vs 23.976 are the same elapsed time) but hash-match or fps-match removes that gamble.