Adds lib/audit-coverage.py: queries Jellyfin live for every series, every episode, and every movie; classifies each by whether the English subtitle comes from a sidecar, embedded stream, or doesn't exist; renders a Markdown report with one-char-per-episode bars for visual scanning. Output file is processes/subtitles/COVERAGE.md, regenerated on demand. v2 sub-rest-fetch.py and v3 sub-a7d-fetch.py now invoke the audit at end of a successful run, so the committed coverage file stays in sync with library state without manual intervention. v3.5 yt-fetch path skips the auto-call since it doesn't speak to Jellyfin directly; run audit manually after copying YT sidecars to nullstone. README.md surfaces the audit at the top so anyone landing in the recipe folder sees current state before starting a run.
215 lines
8.6 KiB
Markdown
215 lines
8.6 KiB
Markdown
# Subtitle acquisition process — v1
|
|
|
|
Last updated: 2026-05-10
|
|
Status: **v3.5** — four fetch paths (plugin / OS REST / Addic7ed / YouTube auto-CC). American Dad 49/58 + Sassy 5/5. v4 WhisperX planned (ROADMAP H5).
|
|
|
|
This recipe is written for Claude Code to execute. Each step lists the exact
|
|
command, what to verify, and what to do on failure. Background reference for
|
|
how Jellyfin and the OpenSubtitles plugin work together lives in
|
|
[`docs/03-subtitles.md`](../../docs/03-subtitles.md).
|
|
|
|
> **Current state:** [`COVERAGE.md`](COVERAGE.md) is the live audit
|
|
> (per-show + per-movie). Regenerate at any time:
|
|
>
|
|
> ```bash
|
|
> JELLYFIN_TOKEN=<admin-token> processes/subtitles/lib/audit-coverage.py
|
|
> ```
|
|
>
|
|
> Run after every fetch batch so the committed file stays accurate.
|
|
>
|
|
> **Read [`STYLE.md`](STYLE.md) first.** Every fetch must hit the
|
|
> bar set there: one English `.srt` per episode, plain (no SDH / no MT / no
|
|
> AI / no Forced), best-quality release. The picker logic in v1/v2/v3
|
|
> mirrors that bar; if a step would violate it, stop and ask before
|
|
> downloading.
|
|
>
|
|
> Stop-gap exception: when the only available source is the v3.5 YouTube
|
|
> auto-CC path (lowercase, censored, mangled names), ship the sub but
|
|
> **add the show to [`STOPGAP-SUBS.md`](STOPGAP-SUBS.md)** so v4 WhisperX
|
|
> picks it up later.
|
|
|
|
---
|
|
|
|
## Prereqs (verify before running)
|
|
|
|
| Check | How |
|
|
|---|---|
|
|
| OpenSubtitles plugin v20 installed + Active | `docker exec jellyfin ls /config/plugins | grep -i opensub` |
|
|
| Plugin creds saved (`Caveman5`) | `docker exec jellyfin grep -E 'Username\|CredentialsInvalid' /config/plugins/configurations/Jellyfin.Plugin.OpenSubtitles.xml` — expect `Caveman5` and `false` |
|
|
| TV library has `SaveSubtitlesWithMedia=true`, `SubtitleDownloadLanguages=["eng"]`, `RequirePerfectSubtitleMatch=false` | `curl -s -H "X-Emby-Token: $TOK" http://localhost:8096/Library/VirtualFolders` |
|
|
| Free-tier quota remaining today (≥ episode count, else plan multi-day) | `docker logs --tail 200 jellyfin 2>&1 \| grep "Remaining downloads" \| tail -1` (free = 20/day, resets 00:00 UTC) |
|
|
| Source files have audio language tag | `ffprobe` sample episode |
|
|
|
|
If any prereq fails, stop. Fix it before running the recipe.
|
|
|
|
---
|
|
|
|
## Step 1 — Probe the source
|
|
|
|
Pick one episode of the target show. Run `ffprobe` on it:
|
|
|
|
```bash
|
|
ssh user@192.168.0.100 'docker exec jellyfin /usr/lib/jellyfin-ffmpeg/ffprobe -hide_banner "<path-to-mkv>" 2>&1 | grep -E "Stream|Duration"'
|
|
```
|
|
|
|
Record in the run log:
|
|
|
|
- video codec + resolution + frame rate
|
|
- audio language tag(s)
|
|
- whether any subtitle streams are embedded
|
|
- container
|
|
|
|
Decide based on probe:
|
|
|
|
| Probe result | Branch |
|
|
|---|---|
|
|
| English audio, no embedded subs | "simple" path (this recipe) |
|
|
| Foreign-dub audio, no embedded subs | "foreign-dub" path (deferred to v?) |
|
|
| Embedded English subs already present | skip — Jellyfin will use them |
|
|
| Embedded PGS/VobSub bitmap subs | "OCR" path (deferred to v?) |
|
|
|
|
---
|
|
|
|
## Step 2 — Resolve series + episode IDs
|
|
|
|
```bash
|
|
TOK=<jellyfin-admin-token>
|
|
SERIES_NAME='American Dad'
|
|
ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
|
|
'http://localhost:8096/Items?searchTerm=${SERIES_NAME// /+}&IncludeItemTypes=Series&Recursive=true&Limit=3'" \
|
|
| python3 -c "import json,sys; [print(x['Id'],x['Name']) for x in json.load(sys.stdin).get('Items',[])]"
|
|
```
|
|
|
|
Record series Id. Then list episodes:
|
|
|
|
```bash
|
|
SERIES=<series-id>
|
|
ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
|
|
'http://localhost:8096/Items?ParentId=$SERIES&IncludeItemTypes=Episode&Recursive=true&Fields=Path,ParentIndexNumber,IndexNumber'" \
|
|
| python3 -c "import json,sys; [print(e['Id'],'S%02dE%02d'%(e['ParentIndexNumber'],e['IndexNumber']),e['Name']) for e in json.load(sys.stdin)['Items']]"
|
|
```
|
|
|
|
---
|
|
|
|
## Step 3 — Pick fetch path
|
|
|
|
Four paths, ordered cheapest-quota-cost-first:
|
|
|
|
| Path | Cost / day cap | Coverage | Tool |
|
|
|---|---|---|---|
|
|
| **v3 Addic7ed** | free, no daily cap (anon) | English-only; near-complete on broadcast US shows; spotty on animated specials / niche titles | `lib/sub-a7d-fetch.py` |
|
|
| **v2 OS REST** | 20 / day on free OS account | best overall coverage; survives any S/E numbering quirk via per-ep `imdb_id` | `lib/sub-rest-fetch.py` |
|
|
| **v1 plugin** | counts against same OS 20/day | only works when library numbering matches OS catalogue (e.g. fails on American Dad past S01E07) | `lib/sub-fetch.sh` |
|
|
| **v3.5 YouTube auto-CC** | free, ratelimited only | for shows distributed YouTube-first (no community subs anywhere); produces lowercase, no-punctuation, name-mangled subs — **stop-gap, violates STYLE.md** | `lib/sub-yt-fetch.sh` + `lib/yt-clean.py` |
|
|
| **v4 WhisperX (planned)** | local CPU/GPU time | full-quality auto-transcription, restores STYLE.md bar for niche shows | TBD `lib/sub-whisperx-fetch.py` (ROADMAP H5) |
|
|
|
|
Default: try **v3** first to spare quota; fall back to **v2** for episodes
|
|
v3 misses or for non-English needs. **v1** stays for shows where simple
|
|
plugin auto-fetch is enough. **v3.5** is the stop-gap when nothing exists
|
|
on community providers; **v4** replaces v3.5 once the GPU node is set up.
|
|
|
|
Quick check whether v1 plugin will suffice (skip the rest if yes):
|
|
|
|
1. Pick the first episode of season 2 in the library.
|
|
2. Run `curl -s -H 'X-Emby-Token: $TOK' 'http://localhost:8096/Items/$EP/RemoteSearch/Subtitles/eng'` (read-only).
|
|
3. If results > 0 — v1 works.
|
|
4. If results == 0 but the show exists on opensubtitles.com — numbering mismatch (e.g. American Dad: library uses Hulu S1=7 eps; OS uses different). Use **v3** then **v2** for misses.
|
|
|
|
---
|
|
|
|
## Step 4 — Fetch subs per episode
|
|
|
|
### v3 — Addic7ed (default, free)
|
|
|
|
```bash
|
|
JELLYFIN_TOKEN=<admin-token> \
|
|
OPENSUBTITLES_API_KEY=$HOME/.config/arrflix-opensubtitles-api.txt \
|
|
processes/subtitles/lib/sub-a7d-fetch.py <series-id> --season N [--start E] [--end E]
|
|
```
|
|
|
|
Pre-flight with `DRY_RUN=1`. The OS REST key is used only for search
|
|
(quota-free) to translate library S/E to the show's catalogue numbering.
|
|
|
|
### v2 — OpenSubtitles REST (fallback for v3 misses)
|
|
|
|
```bash
|
|
JELLYFIN_TOKEN=<admin-token> \
|
|
OPENSUBTITLES_API_KEY=$HOME/.config/arrflix-opensubtitles-api.txt \
|
|
OPENSUBTITLES_USER=Caveman5 \
|
|
OPENSUBTITLES_PASS=<password> \
|
|
processes/subtitles/lib/sub-rest-fetch.py <series-id> --season N [--start E] [--end E]
|
|
```
|
|
|
|
20 / day cap, resets at 00:00 UTC.
|
|
|
|
### v1 — Jellyfin plugin (when library numbering matches OS)
|
|
|
|
`lib/sub-fetch.sh` — see header for env. Counts against the same 20/day cap.
|
|
|
|
### Verify after each batch
|
|
|
|
```bash
|
|
ssh user@192.168.0.100 'ls "<media-dir>/" | grep -c eng.srt'
|
|
```
|
|
|
|
---
|
|
|
|
## Step 5 — Library scan + de-dup (v1 only)
|
|
|
|
If you used the v1 plugin path, the metadata-cache copy and the media-folder
|
|
sidecar both register as subtitle streams in Jellyfin (counted twice).
|
|
Delete the cache copies:
|
|
|
|
```bash
|
|
ssh user@192.168.0.100 'docker exec jellyfin bash -c "find /config/metadata/library -path \"*<show-name>*S0[1-9]E*.eng.srt\" -delete -print"'
|
|
```
|
|
|
|
v2 writes directly to the media folder so there is no cache copy to clean.
|
|
|
|
Trigger a validation-only refresh so Jellyfin sees the new sidecars:
|
|
|
|
```bash
|
|
ssh user@192.168.0.100 "docker exec jellyfin curl -s -X POST -H 'X-Emby-Token: $TOK' \
|
|
'http://localhost:8096/Items/$SERIES/Refresh?MetadataRefreshMode=ValidationOnly&Recursive=true'"
|
|
```
|
|
|
|
Confirm one episode has exactly 1 external eng sub stream:
|
|
|
|
```bash
|
|
ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
|
|
'http://localhost:8096/Items/<sample-ep-id>?Fields=MediaStreams'" \
|
|
| python3 -c "import json,sys; subs=[s for s in json.load(sys.stdin).get('MediaStreams',[]) if s['Type']=='Subtitle']; print(len(subs),'sub streams')"
|
|
```
|
|
|
|
---
|
|
|
|
## Step 6 — Quality gate
|
|
|
|
For the run to pass:
|
|
|
|
- [ ] **Coverage**: every episode has a matching `<base>.eng.srt` sidecar
|
|
- [ ] **Sync sample**: at least one episode of each season is opened in
|
|
Jellyfin web and subs visually align with audio (±1 s) on a known dialogue
|
|
line
|
|
- [ ] **Flag check**: no `.sdh.srt`, `.forced.srt`, or `.hi.srt` files
|
|
(machine pick should have filtered)
|
|
- [ ] **Stream count**: Jellyfin shows exactly 1 external eng sub per episode
|
|
|
|
If any check fails, log it in `runs/<show>.md` under "breakage" and propose
|
|
the recipe amendment in `CHANGELOG.md`.
|
|
|
|
---
|
|
|
|
## Quota hygiene
|
|
|
|
Free OpenSubtitles.com account = 20 downloads / day, resets 00:00 UTC.
|
|
Plan large series across multiple days, or switch to VIP (~$3/mo, unlimited).
|
|
|
|
Quota check:
|
|
|
|
```bash
|
|
ssh user@192.168.0.100 'docker logs --tail 200 jellyfin 2>&1 | grep "Remaining downloads" | tail -1'
|
|
```
|
|
|
|
When quota hits 0 the API returns 0 results, indistinguishable from a real
|
|
miss. Always check quota before declaring a "no subs" failure.
|