Original-release bitmap subs (PGS, VobSub, dvd_subtitle) are first-class, not stop-gaps. They're the canonical studio render — bitmap encoding is just a format choice, not a quality compromise. OCR'd or AI-rebuilt sidecars introduce transcription error that the source doesn't have. STYLE.md changes: - New "Source priority" section with 4 tiers: original text > original bitmap > trusted text rips > WhisperX rebuild. - "What lands on disk" loosened: at least one English stream (embedded OR sidecar), keep embedded codec as-is, sidecar still .srt. - New "OCR bitmap -> text" section documenting pgsrip recipe as an optional UX-nicety augmentation, not a correctness fix. - "Why these rules" now explains why original > pretty (esp. for older shows like Futurama S1-3 / early Archer where the master is the only authoritative source and upscale artifacts already dominate). STOPGAP-SUBS.md: header note clarifying bitmap-from-disc is NOT a stop-gap; lists Lilo & Stitch (2002) and Archer (2009) S02 as examples of correct-as-shipped library entries.
134 lines
6.2 KiB
Markdown
134 lines
6.2 KiB
Markdown
# Subtitle USER-G style — ARRFLIX
|
||
|
||
The bar every fetch should hit. If a recipe step would violate any of these,
|
||
stop and ask before proceeding.
|
||
|
||
## Source priority (highest → lowest)
|
||
|
||
Accuracy beats format. Use this tier ladder before reaching for OCR/AI:
|
||
|
||
1. **Original release text subs** (`.srt`/`.ass` from disc/streamer rip,
|
||
embedded or sidecar). Ground truth — ship as-is.
|
||
2. **Original release bitmap subs** (PGS, VobSub, `dvd_subtitle`,
|
||
embedded). **Acceptable in their native form** — they ARE the original
|
||
words from the source master, just rendered as images. Jellyfin server
|
||
burns them in for clients that can't render natively. Optionally
|
||
OCR-extract a `.srt` sidecar alongside (do NOT replace the embedded
|
||
stream) when client-side styling, search, or mobile rendering matters.
|
||
3. **Trusted text rips** from OpenSubtitles (verified uploads, hash-match
|
||
or high-download-count + frame-rate-match).
|
||
4. **WhisperX rebuild** with `--initial_prompt` proper-nouns yaml — only
|
||
when no original exists (e.g. user-uploaded YT content with auto-CC).
|
||
Logged in [`STOPGAP-SUBS.md`](STOPGAP-SUBS.md) until cleared.
|
||
|
||
Tier 1 and 2 are first-class. Tier 3 is a fallback. Tier 4 is a stop-gap.
|
||
|
||
## What lands on disk
|
||
|
||
- **At least one** English subtitle stream per episode (embedded OR
|
||
sidecar — not both required).
|
||
- Sidecar filename when used: `<videobasename>.eng.srt` — no
|
||
language-region tags (`en-US`), no flag stack on regular subs (no
|
||
`.sdh`, no `.forced`, no `.cc` unless there genuinely is no
|
||
plain-English option).
|
||
- Sidecar format: `.srt` (SubRip text). For embedded: keep the original
|
||
codec (`subrip`, `ass`, `pgs`, `vobsub`, `dvd_subtitle`) — do NOT
|
||
re-mux just to convert format. Convert only when extracting to disk:
|
||
text codecs via `ffmpeg -map 0:s:0 -c:s srt`, bitmap codecs via OCR
|
||
(see "OCR bitmap → text" below).
|
||
- Encoding (sidecar): UTF-8. Re-encode with `iconv` if a sidecar comes
|
||
back as cp1252 / windows-1250.
|
||
|
||
## What gets picked
|
||
|
||
In order:
|
||
|
||
1. **English** language — `eng` / `en`. Never auto-pick `en-US`/`en-GB`
|
||
variants over plain `en`; treat them equivalent for matching.
|
||
2. **No SDH / Hearing Impaired** — drop any sub flagged `hearing_impaired`,
|
||
`sdh`, `cc`. Only fall back to SDH if no plain-English option exists.
|
||
3. **No machine / AI translation** — drop `machine_translated`,
|
||
`ai_translated`. Hand-authored subs only.
|
||
4. **No forced subtitles** — drop `foreign_parts_only` / `Forced` unless the
|
||
episode has English audio with foreign-language scenes that need
|
||
translation (rare for US shows).
|
||
5. **Frame-rate match** — prefer entries whose declared fps matches the
|
||
source video (typically 23.976 for our masters). Treat `0.0` as unknown
|
||
and fall through to step 6.
|
||
6. **Highest download count** within the surviving candidates — proxies for
|
||
"the version everyone agreed was best."
|
||
|
||
After fetch, **eyeball-verify one sample episode per show** plays in sync
|
||
(±1 s on a known dialogue line) before declaring the show done.
|
||
|
||
## What doesn't ship
|
||
|
||
- Multiple language tracks per episode (no German/French alternatives —
|
||
English-only library). Drop non-English embedded streams via
|
||
`mkvpropedit` only if user complains about client picker clutter; do
|
||
NOT silently strip them on import.
|
||
- Director's commentary, behind-the-scenes, song-only subs.
|
||
- Subs that cover only a partial runtime (the partial-cover heuristic isn't
|
||
scripted yet; spot-check duration vs episode runtime if a srt looks short).
|
||
- "All-episodes-in-one" mega-packs treated as a single episode's sidecar.
|
||
|
||
## OCR bitmap → text (optional, tier-2 augmentation)
|
||
|
||
Embedded PGS/VobSub/`dvd_subtitle` are acceptable as-is (tier 2). OCR
|
||
becomes worthwhile when: (a) client repeatedly transcodes due to bitmap
|
||
burn-in (CPU pressure on nullstone — no GPU transcode available), (b)
|
||
user wants to restyle font/size on a specific show, (c) mobile client
|
||
renders bitmap subs poorly.
|
||
|
||
Recipe (`pgsrip`, batch-friendly, Tesseract-backed):
|
||
|
||
```bash
|
||
pip install pgsrip
|
||
# PGS: extract embedded to .sup
|
||
ffmpeg -i input.mkv -map 0:s:0 -c copy subs.sup
|
||
pgsrip --language eng subs.sup # -> subs.srt
|
||
|
||
# VobSub/dvd_subtitle: extract to .idx + .sub
|
||
mkvextract input.mkv tracks 2:subs.idx
|
||
pgsrip subs.idx # -> subs.srt
|
||
```
|
||
|
||
OCR accuracy ~90–95 % raw, ~95–98 % after Subtitle Edit cleanup. Source
|
||
words are correct (it's transcription of original render, not Whisper
|
||
hallucination) — only font recognition fights you. Resulting `.srt`
|
||
ships as sidecar **alongside** the embedded bitmap stream, not as a
|
||
replacement.
|
||
|
||
## How the UI presents subs
|
||
|
||
The detail-page subtitle dropdown is shimmed via
|
||
`web-overrides/index.html` (markers `SUB-LABEL-SHIM-BEGIN/END`). Stock
|
||
Jellyfin shows e.g. `English - SUBRIP - External - Default`; the shim
|
||
collapses to `English`, with `(Forced)` / `(SDH)` / `(Hearing Impaired)`
|
||
suffixes only when those flags actually apply. `Default` is dropped — it's
|
||
redundant when there's only one stream per language.
|
||
|
||
Revert: `bin/revert-sub-label-shim.sh`.
|
||
|
||
## Why these rules
|
||
|
||
- Boutique-release-group quality bar from
|
||
[`README.md`](../../README.md): "every show and film is the best version
|
||
I could put together."
|
||
- **Original-release subs > pretty format.** The DVD/BD/streamer master
|
||
is the canonical script — bitmap or text, those are the words the
|
||
studio shipped. An OCR'd or AI-rebuilt sidecar is a derivative that
|
||
introduces error (font confusion, mistranscription); the original
|
||
doesn't. Especially true for older shows (Futurama S1–S3, Archer
|
||
early seasons) where the master is the only authoritative source and
|
||
upscale artifacts already dominate the visual experience — bitmap
|
||
subs match the source vibe.
|
||
- One-language library = one stream per ep = no need to expose codec or
|
||
source in UI.
|
||
- SDH/CC adds `[door slams]`, `[music]` etc. — distracting on first watch
|
||
and not what someone reaches for unless they specifically need it.
|
||
- Machine / AI translations are inconsistent and often wrong on slang or
|
||
show-specific terms (esp. animated comedies).
|
||
- Frame-rate-matched subs sync without manual offset on first try; mismatch
|
||
generally still works on NTSC (29.97 vs 23.976 are the same elapsed time)
|
||
but hash-match or fps-match removes that gamble.
|