legacy-arrflix/processes/subtitles/CHANGELOG.md
s8n eb71cf6beb processes/subtitles: v3.5 YouTube auto-CC stop-gap + Sassy 5/5
Adds lib/sub-yt-fetch.sh (yt-dlp wrapper) and lib/yt-clean.py (collapses
YouTube's rolling-window auto-caption VTT into a flat SRT). For shows
distributed YouTube-first that have no community subs anywhere -- verified
via three parallel research agents covering OpenSubtitles REST, OS legacy,
Addic7ed, SubDL, SubSource, and Podnapisi for the 5 niche shows in the
library, plus a price-vs-coverage analysis of OpenSubtitles VIP.

Findings: OS VIP would not have helped on the niche shows (it is
download-cap relief, not coverage unlock; same catalog as free). All 4
Jarrad Wright shows in the library (Sassy, Big Lez Saga, Donny &
Clarence, Mike Nolan) live on the same channel and have only YouTube
auto-CC available. v3.5 ships those, explicitly violating STYLE.md
'best quality' as a tracked stop-gap.

Sassy the Sasquatch S01 5/5 episodes subbed with cleaned auto-CC. Mike
Nolan special-case noted: a 'COMPLETE SEASON | SUBTITLES' YT upload from
Oct 2025 carries hand-typed CCs and should be preferred over per-episode
auto-CC when subbing that show.

ROADMAP H5 added: v4 WhisperX large-v3 on the friend RTX 4080 node will
regenerate the v3.5 stop-gap with proper-noun-prompted transcription
(~4-6%% WER vs ~12%% YT auto-CC) and restore the STYLE.md quality bar.
H1 OpenSubtitles credentials marked done (was completed 2026-05-09).
2026-05-10 01:05:07 +01:00

6.3 KiB
Raw Blame History

Subtitle process — changelog

v1 — 2026-05-09

Initial recipe. Drafted while running on American Dad. Distilled from doc 03-subtitles.md (Futurama work) and the actual AD run.

Approach: Jellyfin RemoteSearch/Subtitles/eng → pick best non-HI/non-MT match via Python filter → POST download → docker cp metadata cache → media folder → delete cache dupes → validation refresh.

Scope: works on shows whose library season/episode numbering matches OpenSubtitles' indexed numbering. Verified passing on AD S01 (7/7 episodes).

Known break — added 2026-05-09 same day

After S01 passed, S02 returned 0 results for every episode probed (E01, E02, E08, E13). Quota was fine (13 downloads remaining). Cause:

Jellyfin metadata for American Dad uses Hulu/DSP season ordering (S1=7, S2=16, S3=19, S4=16). OpenSubtitles indexes by Fox original-airing order where S1 has 23 episodes. The plugin queries OS by (parent_imdb_id, season_number, episode_number). For library S02E01 "Bullocks to Stan" the plugin sends S=2,E=1 but OS catalogues that episode as S=1,E=8. Result: 0 hits.

Each library episode has its own correct per-episode IMDB id (e.g. tt0511631 for "Bullocks to Stan") which would resolve directly via OS REST imdb_id= parameter, but the plugin doesn't expose that path.

v2 — 2026-05-09

Approach A chosen: direct OpenSubtitles REST API, per-episode imdb_id lookup, bypass the Jellyfin plugin entirely. New helper at lib/sub-rest-fetch.py.

  • API key file: ~/.config/arrflix-opensubtitles-api.txt (mode 600)
  • Account: Caveman5 (free tier, 20 downloads/day)
  • Saves sidecars directly to nullstone media folder via ssh ... cat >
  • No more docker-cp from /config/metadata/library cache (plugin path)

Recipe upgrade:

  • Step 4 swaps lib/sub-fetch.shlib/sub-rest-fetch.py for shows with non-standard season ordering.
  • Picker logic identical: filter HI/MT/AI/Forced (renamed foreign_parts_only in OS REST), prefer 23.976fps, sort by download_count desc.

v2 known quirks

  • OpenSubtitles /download endpoint rejects urllib — consistent HTTP 503 via Python urllib.request, HTTP 200 via curl with same headers/body. _curl() shim added; all OS API calls go through it. Each 503 still consumes 1 download-quota slot, so this had to be fixed before retrying large batches.
  • download_count of 0 and fps of 0.0 appear on some catalogue entries; treat as informational, not exclusionary.
  • Some hits have file_name mismatching the imdb_id searched (OS metadata drift). Recipe Step 6 visual-sync check is the catch.

v2 known limits

  • Free-tier 20/day still in force (REST and plugin share the counter).
  • Recipe Step 6 (sync verification) is still manual — no automated check that the picked .srt actually aligns with audio.

v3 — 2026-05-09

Approach Addic7ed via subliminal added as a quota-free fallback. New helper at lib/sub-a7d-fetch.py. Runs alongside v2; pick whichever fits.

  • subliminal Python lib drives addic7ed provider, anonymous
  • OS REST is still consulted (search-only, no quota cost) to translate library Hulu numbering to the show's primary catalogue numbering, since Addic7ed and OS feature_details appear to align for at least the test show (American Dad)
  • Sidecar written direct to nullstone via ssh ... cat >

v3 picker / matching

  • subliminal returns ordered candidates by match score; takes first
  • "!" in series name breaks subliminal's matcher; recipe strips it before building the synthetic filename for Video.fromname()
  • Synthetic filename pattern: Series.Name.Year.SXXEYY.HDTV.x264.mkv

v3 known quirks

  • Some episodes return 0 hits at addic7ed for the OS-feat-details S/E we pass — likely cases where addic7ed indexes by Fox airing order while OS uses DVD-compressed (or vice versa). On American Dad, ~9 of 58 episodes missed via this path. Fall back to v2 OS REST when quota allows.
  • One episode (Black Mystery Month) had a hit but downloaded empty content — addic7ed-side cataloguing error or temp 0-byte upload.
  • Per-show coverage varies: Addic7ed has near-complete English on broadcast US shows but spotty for animated specials and obscure titles.

v3 known limits

  • English coverage best; non-English near-empty
  • Anonymous downloads work but heavy bursts may trigger Addic7ed's bot detection and short IP throttle (~1 hour). The script makes no effort at jittering / backoff
  • No automated sync-quality check; recipe Step 6 still manual

v3.5 — 2026-05-10 (stop-gap path for niche YouTube-distributed shows)

For shows that distribute on YouTube and have no community subs anywhere (verified by parallel research agents covering OS REST / OS legacy / Addic7ed / SubDL / SubSource / Podnapisi for 5 niche shows), pull the YouTube auto-CC track via yt-dlp and clean it.

  • New helper: lib/sub-yt-fetch.sh (yt-dlp wrapper) + lib/yt-clean.py (rolling-window VTT → flat SRT cleaner)
  • First applied to Sassy the Sasquatch (2022), S01 5/5 episodes
  • Reusable for the rest of the Big Lez universe (same channel hosts Donny & Clarence, Mike Nolan, Big Lez Saga)

v3.5 known limits — explicitly violates STYLE.md "best quality"

  • Lowercase, no punctuation, no sentence segmentation
  • Proper-noun mishears (Sassy → "sasha", Big Lez → "Big Less")
  • Profanity censored as [ __ ] by YouTube's ASR
  • Will be replaced wholesale by v4 WhisperX (see ROADMAP H5)

v3.5 also discovered

  • OpenSubtitles VIP would not have helped. Verified: VIP is download-cap relief and ad removal, not coverage unlock. Same catalog as free.
  • Mike Nolan special-case: a YouTube upload titled "MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES" (Oct 2025) carries hand-typed CCs. When subbing Mike Nolan, prefer ripping that single upload over the per-episode auto-CC playlist path.

v4 — planned (see ROADMAP H5)

Path: WhisperX large-v3 on friend RTX 4080 node (100.64.0.3).

  • Replaces v3.5 stop-gap with full-quality auto-transcription
  • Per-show proper-noun prompt at processes/subtitles/prompts/<show>.yaml
  • New helper: lib/sub-whisperx-fetch.py (TBD)
  • Expected WER: 46% on noisy / animated dialogue (vs ~12% YT auto-CC)
  • Restores STYLE.md "one clean English sub per ep" bar for niche shows
  • Cloud fallback: ElevenLabs Scribe v2 (~$0.40/hr, ~2.2% WER) for any episode WhisperX still misses