# Subtitle process — changelog ## v1 — 2026-05-09 Initial recipe. Drafted while running on American Dad. Distilled from doc 03-subtitles.md (Futurama work) and the actual AD run. Approach: Jellyfin RemoteSearch/Subtitles/eng → pick best non-HI/non-MT match via Python filter → POST download → docker cp metadata cache → media folder → delete cache dupes → validation refresh. Scope: works on shows whose library season/episode numbering matches OpenSubtitles' indexed numbering. Verified passing on AD S01 (7/7 episodes). ### Known break — added 2026-05-09 same day After S01 passed, S02 returned 0 results for every episode probed (E01, E02, E08, E13). Quota was fine (13 downloads remaining). Cause: > Jellyfin metadata for American Dad uses **Hulu/DSP season ordering** > (S1=7, S2=16, S3=19, S4=16). OpenSubtitles indexes by **Fox original-airing > order** where S1 has 23 episodes. The plugin queries OS by > `(parent_imdb_id, season_number, episode_number)`. For library S02E01 > "Bullocks to Stan" the plugin sends `S=2,E=1` but OS catalogues that > episode as `S=1,E=8`. Result: 0 hits. Each library episode has its own correct per-episode IMDB id (e.g. `tt0511631` for "Bullocks to Stan") which would resolve directly via OS REST `imdb_id=` parameter, but the plugin doesn't expose that path. ## v2 — 2026-05-09 Approach **A** chosen: direct OpenSubtitles REST API, per-episode `imdb_id` lookup, bypass the Jellyfin plugin entirely. New helper at `lib/sub-rest-fetch.py`. - API key file: `~/.config/arrflix-opensubtitles-api.txt` (mode 600) - Account: `Caveman5` (free tier, 20 downloads/day) - Saves sidecars directly to nullstone media folder via `ssh ... cat >` - No more docker-cp from `/config/metadata/library` cache (plugin path) Recipe upgrade: - Step 4 swaps `lib/sub-fetch.sh` → `lib/sub-rest-fetch.py` for shows with non-standard season ordering. - Picker logic identical: filter HI/MT/AI/Forced (renamed `foreign_parts_only` in OS REST), prefer 23.976fps, sort by `download_count` desc. ### v2 known quirks - **OpenSubtitles `/download` endpoint rejects urllib** — consistent HTTP 503 via Python `urllib.request`, HTTP 200 via `curl` with same headers/body. `_curl()` shim added; all OS API calls go through it. **Each 503 still consumes 1 download-quota slot**, so this had to be fixed before retrying large batches. - `download_count` of `0` and `fps` of `0.0` appear on some catalogue entries; treat as informational, not exclusionary. - Some hits have `file_name` mismatching the `imdb_id` searched (OS metadata drift). Recipe Step 6 visual-sync check is the catch. ### v2 known limits - Free-tier 20/day still in force (REST and plugin share the counter). - Recipe Step 6 (sync verification) is still manual — no automated check that the picked .srt actually aligns with audio. ## v3 — 2026-05-09 Approach **Addic7ed via subliminal** added as a quota-free fallback. New helper at `lib/sub-a7d-fetch.py`. Runs alongside v2; pick whichever fits. - `subliminal` Python lib drives `addic7ed` provider, anonymous - OS REST is still consulted (search-only, no quota cost) to translate library Hulu numbering to the show's primary catalogue numbering, since Addic7ed and OS feature_details appear to align for at least the test show (American Dad) - Sidecar written direct to nullstone via `ssh ... cat >` ### v3 picker / matching - subliminal returns ordered candidates by match score; takes first - "!" in series name breaks subliminal's matcher; recipe strips it before building the synthetic filename for `Video.fromname()` - Synthetic filename pattern: `Series.Name.Year.SXXEYY.HDTV.x264.mkv` ### v3 known quirks - Some episodes return 0 hits at addic7ed for the OS-feat-details S/E we pass — likely cases where addic7ed indexes by Fox airing order while OS uses DVD-compressed (or vice versa). On American Dad, ~9 of 58 episodes missed via this path. Fall back to v2 OS REST when quota allows. - One episode (`Black Mystery Month`) had a hit but downloaded empty content — addic7ed-side cataloguing error or temp 0-byte upload. - Per-show coverage varies: Addic7ed has near-complete English on broadcast US shows but spotty for animated specials and obscure titles. ### v3 known limits - English coverage best; non-English near-empty - Anonymous downloads work but heavy bursts may trigger Addic7ed's bot detection and short IP throttle (~1 hour). The script makes no effort at jittering / backoff - No automated sync-quality check; recipe Step 6 still manual ## v3.5 — 2026-05-10 (stop-gap path for niche YouTube-distributed shows) For shows that distribute on YouTube and have no community subs anywhere (verified by parallel research agents covering OS REST / OS legacy / Addic7ed / SubDL / SubSource / Podnapisi for 5 niche shows), pull the YouTube auto-CC track via yt-dlp and clean it. - New helper: `lib/sub-yt-fetch.sh` (yt-dlp wrapper) + `lib/yt-clean.py` (rolling-window VTT → flat SRT cleaner) - First applied to **Sassy the Sasquatch (2022)**, S01 5/5 episodes - Reusable for the rest of the Big Lez universe (same channel hosts Donny & Clarence, Mike Nolan, Big Lez Saga) ### v3.5 known limits — explicitly violates STYLE.md "best quality" - Lowercase, no punctuation, no sentence segmentation - Proper-noun mishears (Sassy → "sasha", Big Lez → "Big Less") - Profanity censored as `[ __ ]` by YouTube's ASR - Will be replaced wholesale by v4 WhisperX (see ROADMAP H5) ### v3.5 also discovered - **OpenSubtitles VIP would not have helped.** Verified: VIP is download-cap relief and ad removal, not coverage unlock. Same catalog as free. - **Mike Nolan special-case**: a YouTube upload titled "MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES" (Oct 2025) carries hand-typed CCs. When subbing Mike Nolan, prefer ripping that single upload over the per-episode auto-CC playlist path. ## v4 — planned (see ROADMAP H5) Path: **WhisperX large-v3 on friend RTX 4080 node** (`100.64.0.3`). - Replaces v3.5 stop-gap with full-quality auto-transcription - Per-show proper-noun prompt at `processes/subtitles/prompts/.yaml` - New helper: `lib/sub-whisperx-fetch.py` (TBD) - Expected WER: 4–6% on noisy / animated dialogue (vs ~12% YT auto-CC) - Restores STYLE.md "one clean English sub per ep" bar for niche shows - Cloud fallback: ElevenLabs Scribe v2 (~$0.40/hr, ~2.2% WER) for any episode WhisperX still misses