legacy-arrflix/processes/subtitles/README.md
s8n eb71cf6beb processes/subtitles: v3.5 YouTube auto-CC stop-gap + Sassy 5/5
Adds lib/sub-yt-fetch.sh (yt-dlp wrapper) and lib/yt-clean.py (collapses
YouTube's rolling-window auto-caption VTT into a flat SRT). For shows
distributed YouTube-first that have no community subs anywhere -- verified
via three parallel research agents covering OpenSubtitles REST, OS legacy,
Addic7ed, SubDL, SubSource, and Podnapisi for the 5 niche shows in the
library, plus a price-vs-coverage analysis of OpenSubtitles VIP.

Findings: OS VIP would not have helped on the niche shows (it is
download-cap relief, not coverage unlock; same catalog as free). All 4
Jarrad Wright shows in the library (Sassy, Big Lez Saga, Donny &
Clarence, Mike Nolan) live on the same channel and have only YouTube
auto-CC available. v3.5 ships those, explicitly violating STYLE.md
'best quality' as a tracked stop-gap.

Sassy the Sasquatch S01 5/5 episodes subbed with cleaned auto-CC. Mike
Nolan special-case noted: a 'COMPLETE SEASON | SUBTITLES' YT upload from
Oct 2025 carries hand-typed CCs and should be preferred over per-episode
auto-CC when subbing that show.

ROADMAP H5 added: v4 WhisperX large-v3 on the friend RTX 4080 node will
regenerate the v3.5 stop-gap with proper-noun-prompted transcription
(~4-6%% WER vs ~12%% YT auto-CC) and restore the STYLE.md quality bar.
H1 OpenSubtitles credentials marked done (was completed 2026-05-09).
2026-05-10 01:05:07 +01:00

8.1 KiB

Subtitle acquisition process — v1

Last updated: 2026-05-10 Status: v3.5 — four fetch paths (plugin / OS REST / Addic7ed / YouTube auto-CC). American Dad 49/58 + Sassy 5/5. v4 WhisperX planned (ROADMAP H5).

This recipe is written for Claude Code to execute. Each step lists the exact command, what to verify, and what to do on failure. Background reference for how Jellyfin and the OpenSubtitles plugin work together lives in docs/03-subtitles.md.

Read STYLE.md first. Every fetch must hit the bar set there: one English .srt per episode, plain (no SDH / no MT / no AI / no Forced), best-quality release. The picker logic in v1/v2/v3 mirrors that bar; if a step would violate it, stop and ask before downloading.


Prereqs (verify before running)

Check How
OpenSubtitles plugin v20 installed + Active `docker exec jellyfin ls /config/plugins
Plugin creds saved (Caveman5) docker exec jellyfin grep -E 'Username|CredentialsInvalid' /config/plugins/configurations/Jellyfin.Plugin.OpenSubtitles.xml — expect Caveman5 and false
TV library has SaveSubtitlesWithMedia=true, SubtitleDownloadLanguages=["eng"], RequirePerfectSubtitleMatch=false curl -s -H "X-Emby-Token: $TOK" http://localhost:8096/Library/VirtualFolders
Free-tier quota remaining today (≥ episode count, else plan multi-day) docker logs --tail 200 jellyfin 2>&1 | grep "Remaining downloads" | tail -1 (free = 20/day, resets 00:00 UTC)
Source files have audio language tag ffprobe sample episode

If any prereq fails, stop. Fix it before running the recipe.


Step 1 — Probe the source

Pick one episode of the target show. Run ffprobe on it:

ssh user@192.168.0.100 'docker exec jellyfin /usr/lib/jellyfin-ffmpeg/ffprobe -hide_banner "<path-to-mkv>" 2>&1 | grep -E "Stream|Duration"'

Record in the run log:

  • video codec + resolution + frame rate
  • audio language tag(s)
  • whether any subtitle streams are embedded
  • container

Decide based on probe:

Probe result Branch
English audio, no embedded subs "simple" path (this recipe)
Foreign-dub audio, no embedded subs "foreign-dub" path (deferred to v?)
Embedded English subs already present skip — Jellyfin will use them
Embedded PGS/VobSub bitmap subs "OCR" path (deferred to v?)

Step 2 — Resolve series + episode IDs

TOK=<jellyfin-admin-token>
SERIES_NAME='American Dad'
ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
  'http://localhost:8096/Items?searchTerm=${SERIES_NAME// /+}&IncludeItemTypes=Series&Recursive=true&Limit=3'" \
  | python3 -c "import json,sys; [print(x['Id'],x['Name']) for x in json.load(sys.stdin).get('Items',[])]"

Record series Id. Then list episodes:

SERIES=<series-id>
ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
  'http://localhost:8096/Items?ParentId=$SERIES&IncludeItemTypes=Episode&Recursive=true&Fields=Path,ParentIndexNumber,IndexNumber'" \
  | python3 -c "import json,sys; [print(e['Id'],'S%02dE%02d'%(e['ParentIndexNumber'],e['IndexNumber']),e['Name']) for e in json.load(sys.stdin)['Items']]"

Step 3 — Pick fetch path

Four paths, ordered cheapest-quota-cost-first:

Path Cost / day cap Coverage Tool
v3 Addic7ed free, no daily cap (anon) English-only; near-complete on broadcast US shows; spotty on animated specials / niche titles lib/sub-a7d-fetch.py
v2 OS REST 20 / day on free OS account best overall coverage; survives any S/E numbering quirk via per-ep imdb_id lib/sub-rest-fetch.py
v1 plugin counts against same OS 20/day only works when library numbering matches OS catalogue (e.g. fails on American Dad past S01E07) lib/sub-fetch.sh
v3.5 YouTube auto-CC free, ratelimited only for shows distributed YouTube-first (no community subs anywhere); produces lowercase, no-punctuation, name-mangled subs — stop-gap, violates STYLE.md lib/sub-yt-fetch.sh + lib/yt-clean.py
v4 WhisperX (planned) local CPU/GPU time full-quality auto-transcription, restores STYLE.md bar for niche shows TBD lib/sub-whisperx-fetch.py (ROADMAP H5)

Default: try v3 first to spare quota; fall back to v2 for episodes v3 misses or for non-English needs. v1 stays for shows where simple plugin auto-fetch is enough. v3.5 is the stop-gap when nothing exists on community providers; v4 replaces v3.5 once the GPU node is set up.

Quick check whether v1 plugin will suffice (skip the rest if yes):

  1. Pick the first episode of season 2 in the library.
  2. Run curl -s -H 'X-Emby-Token: $TOK' 'http://localhost:8096/Items/$EP/RemoteSearch/Subtitles/eng' (read-only).
  3. If results > 0 — v1 works.
  4. If results == 0 but the show exists on opensubtitles.com — numbering mismatch (e.g. American Dad: library uses Hulu S1=7 eps; OS uses different). Use v3 then v2 for misses.

Step 4 — Fetch subs per episode

v3 — Addic7ed (default, free)

JELLYFIN_TOKEN=<admin-token> \
OPENSUBTITLES_API_KEY=$HOME/.config/arrflix-opensubtitles-api.txt \
processes/subtitles/lib/sub-a7d-fetch.py <series-id> --season N [--start E] [--end E]

Pre-flight with DRY_RUN=1. The OS REST key is used only for search (quota-free) to translate library S/E to the show's catalogue numbering.

v2 — OpenSubtitles REST (fallback for v3 misses)

JELLYFIN_TOKEN=<admin-token> \
OPENSUBTITLES_API_KEY=$HOME/.config/arrflix-opensubtitles-api.txt \
OPENSUBTITLES_USER=Caveman5 \
OPENSUBTITLES_PASS=<password> \
processes/subtitles/lib/sub-rest-fetch.py <series-id> --season N [--start E] [--end E]

20 / day cap, resets at 00:00 UTC.

v1 — Jellyfin plugin (when library numbering matches OS)

lib/sub-fetch.sh — see header for env. Counts against the same 20/day cap.

Verify after each batch

ssh user@192.168.0.100 'ls "<media-dir>/" | grep -c eng.srt'

Step 5 — Library scan + de-dup (v1 only)

If you used the v1 plugin path, the metadata-cache copy and the media-folder sidecar both register as subtitle streams in Jellyfin (counted twice). Delete the cache copies:

ssh user@192.168.0.100 'docker exec jellyfin bash -c "find /config/metadata/library -path \"*<show-name>*S0[1-9]E*.eng.srt\" -delete -print"'

v2 writes directly to the media folder so there is no cache copy to clean.

Trigger a validation-only refresh so Jellyfin sees the new sidecars:

ssh user@192.168.0.100 "docker exec jellyfin curl -s -X POST -H 'X-Emby-Token: $TOK' \
  'http://localhost:8096/Items/$SERIES/Refresh?MetadataRefreshMode=ValidationOnly&Recursive=true'"

Confirm one episode has exactly 1 external eng sub stream:

ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
  'http://localhost:8096/Items/<sample-ep-id>?Fields=MediaStreams'" \
  | python3 -c "import json,sys; subs=[s for s in json.load(sys.stdin).get('MediaStreams',[]) if s['Type']=='Subtitle']; print(len(subs),'sub streams')"

Step 6 — Quality gate

For the run to pass:

  • Coverage: every episode has a matching <base>.eng.srt sidecar
  • Sync sample: at least one episode of each season is opened in Jellyfin web and subs visually align with audio (±1 s) on a known dialogue line
  • Flag check: no .sdh.srt, .forced.srt, or .hi.srt files (machine pick should have filtered)
  • Stream count: Jellyfin shows exactly 1 external eng sub per episode

If any check fails, log it in runs/<show>.md under "breakage" and propose the recipe amendment in CHANGELOG.md.


Quota hygiene

Free OpenSubtitles.com account = 20 downloads / day, resets 00:00 UTC. Plan large series across multiple days, or switch to VIP (~$3/mo, unlimited).

Quota check:

ssh user@192.168.0.100 'docker logs --tail 200 jellyfin 2>&1 | grep "Remaining downloads" | tail -1'

When quota hits 0 the API returns 0 results, indistinguishable from a real miss. Always check quota before declaring a "no subs" failure.