legacy-arrflix/processes/subtitles/CHANGELOG.md
s8n eb71cf6beb processes/subtitles: v3.5 YouTube auto-CC stop-gap + Sassy 5/5
Adds lib/sub-yt-fetch.sh (yt-dlp wrapper) and lib/yt-clean.py (collapses
YouTube's rolling-window auto-caption VTT into a flat SRT). For shows
distributed YouTube-first that have no community subs anywhere -- verified
via three parallel research agents covering OpenSubtitles REST, OS legacy,
Addic7ed, SubDL, SubSource, and Podnapisi for the 5 niche shows in the
library, plus a price-vs-coverage analysis of OpenSubtitles VIP.

Findings: OS VIP would not have helped on the niche shows (it is
download-cap relief, not coverage unlock; same catalog as free). All 4
Jarrad Wright shows in the library (Sassy, Big Lez Saga, Donny &
Clarence, Mike Nolan) live on the same channel and have only YouTube
auto-CC available. v3.5 ships those, explicitly violating STYLE.md
'best quality' as a tracked stop-gap.

Sassy the Sasquatch S01 5/5 episodes subbed with cleaned auto-CC. Mike
Nolan special-case noted: a 'COMPLETE SEASON | SUBTITLES' YT upload from
Oct 2025 carries hand-typed CCs and should be preferred over per-episode
auto-CC when subbing that show.

ROADMAP H5 added: v4 WhisperX large-v3 on the friend RTX 4080 node will
regenerate the v3.5 stop-gap with proper-noun-prompted transcription
(~4-6%% WER vs ~12%% YT auto-CC) and restore the STYLE.md quality bar.
H1 OpenSubtitles credentials marked done (was completed 2026-05-09).
2026-05-10 01:05:07 +01:00

144 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Subtitle process — changelog
## v1 — 2026-05-09
Initial recipe. Drafted while running on American Dad. Distilled from doc
03-subtitles.md (Futurama work) and the actual AD run.
Approach: Jellyfin RemoteSearch/Subtitles/eng → pick best non-HI/non-MT match
via Python filter → POST download → docker cp metadata cache → media folder →
delete cache dupes → validation refresh.
Scope: works on shows whose library season/episode numbering matches
OpenSubtitles' indexed numbering. Verified passing on AD S01 (7/7 episodes).
### Known break — added 2026-05-09 same day
After S01 passed, S02 returned 0 results for every episode probed (E01, E02,
E08, E13). Quota was fine (13 downloads remaining). Cause:
> Jellyfin metadata for American Dad uses **Hulu/DSP season ordering**
> (S1=7, S2=16, S3=19, S4=16). OpenSubtitles indexes by **Fox original-airing
> order** where S1 has 23 episodes. The plugin queries OS by
> `(parent_imdb_id, season_number, episode_number)`. For library S02E01
> "Bullocks to Stan" the plugin sends `S=2,E=1` but OS catalogues that
> episode as `S=1,E=8`. Result: 0 hits.
Each library episode has its own correct per-episode IMDB id (e.g.
`tt0511631` for "Bullocks to Stan") which would resolve directly via OS REST
`imdb_id=` parameter, but the plugin doesn't expose that path.
## v2 — 2026-05-09
Approach **A** chosen: direct OpenSubtitles REST API, per-episode `imdb_id`
lookup, bypass the Jellyfin plugin entirely. New helper at
`lib/sub-rest-fetch.py`.
- API key file: `~/.config/arrflix-opensubtitles-api.txt` (mode 600)
- Account: `Caveman5` (free tier, 20 downloads/day)
- Saves sidecars directly to nullstone media folder via `ssh ... cat >`
- No more docker-cp from `/config/metadata/library` cache (plugin path)
Recipe upgrade:
- Step 4 swaps `lib/sub-fetch.sh``lib/sub-rest-fetch.py` for shows with
non-standard season ordering.
- Picker logic identical: filter HI/MT/AI/Forced (renamed
`foreign_parts_only` in OS REST), prefer 23.976fps, sort by
`download_count` desc.
### v2 known quirks
- **OpenSubtitles `/download` endpoint rejects urllib** — consistent HTTP 503
via Python `urllib.request`, HTTP 200 via `curl` with same headers/body.
`_curl()` shim added; all OS API calls go through it. **Each 503 still
consumes 1 download-quota slot**, so this had to be fixed before retrying
large batches.
- `download_count` of `0` and `fps` of `0.0` appear on some catalogue
entries; treat as informational, not exclusionary.
- Some hits have `file_name` mismatching the `imdb_id` searched (OS metadata
drift). Recipe Step 6 visual-sync check is the catch.
### v2 known limits
- Free-tier 20/day still in force (REST and plugin share the counter).
- Recipe Step 6 (sync verification) is still manual — no automated check
that the picked .srt actually aligns with audio.
## v3 — 2026-05-09
Approach **Addic7ed via subliminal** added as a quota-free fallback. New
helper at `lib/sub-a7d-fetch.py`. Runs alongside v2; pick whichever fits.
- `subliminal` Python lib drives `addic7ed` provider, anonymous
- OS REST is still consulted (search-only, no quota cost) to translate
library Hulu numbering to the show's primary catalogue numbering, since
Addic7ed and OS feature_details appear to align for at least the test
show (American Dad)
- Sidecar written direct to nullstone via `ssh ... cat >`
### v3 picker / matching
- subliminal returns ordered candidates by match score; takes first
- "!" in series name breaks subliminal's matcher; recipe strips it before
building the synthetic filename for `Video.fromname()`
- Synthetic filename pattern: `Series.Name.Year.SXXEYY.HDTV.x264.mkv`
### v3 known quirks
- Some episodes return 0 hits at addic7ed for the OS-feat-details S/E we
pass — likely cases where addic7ed indexes by Fox airing order while OS
uses DVD-compressed (or vice versa). On American Dad, ~9 of 58 episodes
missed via this path. Fall back to v2 OS REST when quota allows.
- One episode (`Black Mystery Month`) had a hit but downloaded empty
content — addic7ed-side cataloguing error or temp 0-byte upload.
- Per-show coverage varies: Addic7ed has near-complete English on broadcast
US shows but spotty for animated specials and obscure titles.
### v3 known limits
- English coverage best; non-English near-empty
- Anonymous downloads work but heavy bursts may trigger Addic7ed's
bot detection and short IP throttle (~1 hour). The script makes no
effort at jittering / backoff
- No automated sync-quality check; recipe Step 6 still manual
## v3.5 — 2026-05-10 (stop-gap path for niche YouTube-distributed shows)
For shows that distribute on YouTube and have no community subs anywhere
(verified by parallel research agents covering OS REST / OS legacy /
Addic7ed / SubDL / SubSource / Podnapisi for 5 niche shows), pull the
YouTube auto-CC track via yt-dlp and clean it.
- New helper: `lib/sub-yt-fetch.sh` (yt-dlp wrapper) + `lib/yt-clean.py`
(rolling-window VTT → flat SRT cleaner)
- First applied to **Sassy the Sasquatch (2022)**, S01 5/5 episodes
- Reusable for the rest of the Big Lez universe (same channel hosts
Donny & Clarence, Mike Nolan, Big Lez Saga)
### v3.5 known limits — explicitly violates STYLE.md "best quality"
- Lowercase, no punctuation, no sentence segmentation
- Proper-noun mishears (Sassy → "sasha", Big Lez → "Big Less")
- Profanity censored as `[ __ ]` by YouTube's ASR
- Will be replaced wholesale by v4 WhisperX (see ROADMAP H5)
### v3.5 also discovered
- **OpenSubtitles VIP would not have helped.** Verified: VIP is download-cap
relief and ad removal, not coverage unlock. Same catalog as free.
- **Mike Nolan special-case**: a YouTube upload titled
"MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES" (Oct 2025) carries
hand-typed CCs. When subbing Mike Nolan, prefer ripping that single
upload over the per-episode auto-CC playlist path.
## v4 — planned (see ROADMAP H5)
Path: **WhisperX large-v3 on friend RTX 4080 node** (`100.64.0.3`).
- Replaces v3.5 stop-gap with full-quality auto-transcription
- Per-show proper-noun prompt at `processes/subtitles/prompts/<show>.yaml`
- New helper: `lib/sub-whisperx-fetch.py` (TBD)
- Expected WER: 46% on noisy / animated dialogue (vs ~12% YT auto-CC)
- Restores STYLE.md "one clean English sub per ep" bar for niche shows
- Cloud fallback: ElevenLabs Scribe v2 (~$0.40/hr, ~2.2% WER) for any
episode WhisperX still misses