processes/subtitles: v3.5 YouTube auto-CC stop-gap + Sassy 5/5

Adds lib/sub-yt-fetch.sh (yt-dlp wrapper) and lib/yt-clean.py (collapses
YouTube's rolling-window auto-caption VTT into a flat SRT). For shows
distributed YouTube-first that have no community subs anywhere -- verified
via three parallel research agents covering OpenSubtitles REST, OS legacy,
Addic7ed, SubDL, SubSource, and Podnapisi for the 5 niche shows in the
library, plus a price-vs-coverage analysis of OpenSubtitles VIP.

Findings: OS VIP would not have helped on the niche shows (it is
download-cap relief, not coverage unlock; same catalog as free). All 4
Jarrad Wright shows in the library (Sassy, Big Lez Saga, Donny &
Clarence, Mike Nolan) live on the same channel and have only YouTube
auto-CC available. v3.5 ships those, explicitly violating STYLE.md
'best quality' as a tracked stop-gap.

Sassy the Sasquatch S01 5/5 episodes subbed with cleaned auto-CC. Mike
Nolan special-case noted: a 'COMPLETE SEASON | SUBTITLES' YT upload from
Oct 2025 carries hand-typed CCs and should be preferred over per-episode
auto-CC when subbing that show.

ROADMAP H5 added: v4 WhisperX large-v3 on the friend RTX 4080 node will
regenerate the v3.5 stop-gap with proper-noun-prompted transcription
(~4-6%% WER vs ~12%% YT auto-CC) and restore the STYLE.md quality bar.
H1 OpenSubtitles credentials marked done (was completed 2026-05-09).
This commit is contained in:
s8n 2026-05-10 01:05:03 +01:00
parent d9d6bdba64
commit eb71cf6beb
7 changed files with 274 additions and 6 deletions

View file

@ -24,10 +24,11 @@ Last revised: **2026-05-08**
| # | Item | Effort | Blocker |
|---|---|---|---|
| H1 | OpenSubtitles credentials (auth fixes log spam too — doc 13 win 2) | S | **owner signs up at opensubtitles.com** |
| H1 | ~~OpenSubtitles credentials~~ — done 2026-05-09; `Caveman5` saved + free API key at `~/.config/arrflix-opensubtitles-api.txt` | — | done |
| H2 | GPU transcode (nvidia driver kernel module + container toolkit + SecureBoot signing) | L | **owner sudo + reboot** |
| H3 | Apply `bin/force-english-all-users.sh` (German Play button breaks UX for non-English browsers) | S | none — owner runs |
| H4 | Backup `/home/docker/jellyfin/config/` off-host (no automated backup yet) | M | strategy decision |
| H5 | **v4 subtitle path: WhisperX large-v3 on friend RTX 4080 node**. Regenerate Sassy + Big Lez Saga + Donny & Clarence + Mike Nolan with proper-noun prompts (replaces v3.5 YT auto-CC stop-gap). New helper at `processes/subtitles/lib/sub-whisperx-fetch.py`. WhisperX install on 100.64.0.3 (per memory `project_friend_gpu.md`, currently offline 2d); per-show prompt yaml at `processes/subtitles/prompts/<show>.yaml` (recurring proper nouns). Expected 46 % WER vs ~12 % for YT auto-CC; restores STYLE.md "best quality" bar. See `processes/subtitles/runs/sassy-the-sasquatch.md` for context. | M | friend node back online + WhisperX setup |
## 🟨 Open — Medium value

View file

@ -21,4 +21,4 @@ amendment for a full sweep.
| Process | Status | Last touched |
|---|---|---|
| [`subtitles/`](subtitles/) | v3 — Addic7ed (free, no daily cap) added as primary, OS REST as fallback. AD 49/58 subbed; remaining 9 land via OS REST after quota reset | 2026-05-09 |
| [`subtitles/`](subtitles/) | v3.5 — YouTube auto-CC added as stop-gap for shows with no community subs anywhere (verified via 3-agent research run). AD 49/58 + Sassy 5/5. v4 WhisperX planned (ROADMAP H5) | 2026-05-10 |

View file

@ -101,3 +101,44 @@ helper at `lib/sub-a7d-fetch.py`. Runs alongside v2; pick whichever fits.
bot detection and short IP throttle (~1 hour). The script makes no
effort at jittering / backoff
- No automated sync-quality check; recipe Step 6 still manual
## v3.5 — 2026-05-10 (stop-gap path for niche YouTube-distributed shows)
For shows that distribute on YouTube and have no community subs anywhere
(verified by parallel research agents covering OS REST / OS legacy /
Addic7ed / SubDL / SubSource / Podnapisi for 5 niche shows), pull the
YouTube auto-CC track via yt-dlp and clean it.
- New helper: `lib/sub-yt-fetch.sh` (yt-dlp wrapper) + `lib/yt-clean.py`
(rolling-window VTT → flat SRT cleaner)
- First applied to **Sassy the Sasquatch (2022)**, S01 5/5 episodes
- Reusable for the rest of the Big Lez universe (same channel hosts
Donny & Clarence, Mike Nolan, Big Lez Saga)
### v3.5 known limits — explicitly violates STYLE.md "best quality"
- Lowercase, no punctuation, no sentence segmentation
- Proper-noun mishears (Sassy → "sasha", Big Lez → "Big Less")
- Profanity censored as `[ __ ]` by YouTube's ASR
- Will be replaced wholesale by v4 WhisperX (see ROADMAP H5)
### v3.5 also discovered
- **OpenSubtitles VIP would not have helped.** Verified: VIP is download-cap
relief and ad removal, not coverage unlock. Same catalog as free.
- **Mike Nolan special-case**: a YouTube upload titled
"MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES" (Oct 2025) carries
hand-typed CCs. When subbing Mike Nolan, prefer ripping that single
upload over the per-episode auto-CC playlist path.
## v4 — planned (see ROADMAP H5)
Path: **WhisperX large-v3 on friend RTX 4080 node** (`100.64.0.3`).
- Replaces v3.5 stop-gap with full-quality auto-transcription
- Per-show proper-noun prompt at `processes/subtitles/prompts/<show>.yaml`
- New helper: `lib/sub-whisperx-fetch.py` (TBD)
- Expected WER: 46% on noisy / animated dialogue (vs ~12% YT auto-CC)
- Restores STYLE.md "one clean English sub per ep" bar for niche shows
- Cloud fallback: ElevenLabs Scribe v2 (~$0.40/hr, ~2.2% WER) for any
episode WhisperX still misses

View file

@ -1,7 +1,7 @@
# Subtitle acquisition process — v1
Last updated: 2026-05-09
Status: **v3** — three fetch paths (plugin / OS REST / Addic7ed). American Dad 49/58 subbed; remaining 9 land via OS REST after quota reset.
Last updated: 2026-05-10
Status: **v3.5** — four fetch paths (plugin / OS REST / Addic7ed / YouTube auto-CC). American Dad 49/58 + Sassy 5/5. v4 WhisperX planned (ROADMAP H5).
This recipe is written for Claude Code to execute. Each step lists the exact
command, what to verify, and what to do on failure. Background reference for
@ -79,17 +79,20 @@ ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
## Step 3 — Pick fetch path
Three paths, ordered cheapest-quota-cost-first:
Four paths, ordered cheapest-quota-cost-first:
| Path | Cost / day cap | Coverage | Tool |
|---|---|---|---|
| **v3 Addic7ed** | free, no daily cap (anon) | English-only; near-complete on broadcast US shows; spotty on animated specials / niche titles | `lib/sub-a7d-fetch.py` |
| **v2 OS REST** | 20 / day on free OS account | best overall coverage; survives any S/E numbering quirk via per-ep `imdb_id` | `lib/sub-rest-fetch.py` |
| **v1 plugin** | counts against same OS 20/day | only works when library numbering matches OS catalogue (e.g. fails on American Dad past S01E07) | `lib/sub-fetch.sh` |
| **v3.5 YouTube auto-CC** | free, ratelimited only | for shows distributed YouTube-first (no community subs anywhere); produces lowercase, no-punctuation, name-mangled subs — **stop-gap, violates STYLE.md** | `lib/sub-yt-fetch.sh` + `lib/yt-clean.py` |
| **v4 WhisperX (planned)** | local CPU/GPU time | full-quality auto-transcription, restores STYLE.md bar for niche shows | TBD `lib/sub-whisperx-fetch.py` (ROADMAP H5) |
Default: try **v3** first to spare quota; fall back to **v2** for episodes
v3 misses or for non-English needs. **v1** stays for shows where simple
plugin auto-fetch is enough.
plugin auto-fetch is enough. **v3.5** is the stop-gap when nothing exists
on community providers; **v4** replaces v3.5 once the GPU node is set up.
Quick check whether v1 plugin will suffice (skip the rest if yes):

View file

@ -0,0 +1,68 @@
#!/usr/bin/env bash
# Subtitle fetcher v3.5 — YouTube auto-captions via yt-dlp + cleaner.
#
# For shows that distribute on YouTube and have no community subs anywhere
# else (e.g. Big Lez Show universe: Sassy the Sasquatch, Donny & Clarence,
# Mike Nolan, Big Lez Saga). yt-dlp pulls the en-orig auto-CC track, the
# rolling-window VTT goes through yt-clean.py to deduplicate into a flat
# SRT, and the result is dropped on nullstone with the library filename.
#
# Quality caveats (per processes/subtitles/STYLE.md fallback policy):
# - lowercase, no punctuation
# - YouTube ASR mishears proper nouns (e.g. "Sassy" → "sasha")
# - profanity is censored as "[ __ ]"
# - capitalisation / sentence segmentation is absent
#
# These subs ship as a stop-gap. v4 (WhisperX large-v3 on the 4080 friend
# node) replaces them with full-quality transcriptions; see ROADMAP.
#
# Usage:
# sub-yt-fetch.sh <playlist-or-channel-url> <out-dir> <name-template>
#
# Example (Sassy):
# sub-yt-fetch.sh \
# 'https://www.youtube.com/playlist?list=PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj' \
# /tmp/sassy-yt \
# 'Sassy the Sasquatch (2022) - S01E%(playlist_index)02d - %(title)s'
#
# After fetch: rename / copy each .en.srt to nullstone with the canonical
# library filename (`<videobasename>.eng.srt`). For now this is manual —
# automate when the next show comes through.
set -euo pipefail
PLAYLIST="${1:?playlist or channel URL required}"
OUTDIR="${2:?output directory required}"
NAMETMPL="${3:-S%(playlist_index)02d - %(title)s}"
mkdir -p "$OUTDIR"
if ! command -v yt-dlp >/dev/null; then
echo "ERROR: yt-dlp not installed (pip install yt-dlp)" >&2
exit 1
fi
# Pull raw VTT auto-CC, no video, en-orig only (matches en bytewise but is the
# canonical track to request).
yt-dlp --skip-download --write-auto-subs --sub-langs "en-orig" \
--sub-format vtt \
--sleep-requests 1 --sleep-subtitles 2 \
-o "$OUTDIR/${NAMETMPL}-raw.%(ext)s" \
"$PLAYLIST"
CLEANER="$(dirname "$0")/yt-clean.py"
if [[ ! -x "$CLEANER" ]]; then
echo "ERROR: $CLEANER not found / not executable" >&2
exit 2
fi
# Convert each raw VTT to clean SRT
shopt -s nullglob
for vtt in "$OUTDIR"/*-raw.en-orig.vtt; do
out="${vtt%-raw.en-orig.vtt}.en.srt"
python3 "$CLEANER" "$vtt" "$out"
echo "OK $out"
done
echo
echo "next: copy each .en.srt to nullstone with library filename, then library scan."

View file

@ -0,0 +1,56 @@
#!/usr/bin/env python3
"""Clean YouTube auto-caption VTT into a flat SRT with no rolling-window dupes."""
import re, sys, pathlib
def parse_vtt(text):
"""Yield (start, end, line) tuples, dropping inline timing tags and empty lines."""
blocks = re.split(r'\n\n+', text.strip())
for b in blocks:
if 'WEBVTT' in b or b.startswith('Kind:') or b.startswith('Language:'):
continue
m = re.search(r'(\d{2}:\d{2}:\d{2}[.,]\d{3})\s*-->\s*(\d{2}:\d{2}:\d{2}[.,]\d{3})', b)
if not m: continue
start, end = m.group(1), m.group(2)
# Strip cue settings and inline <00:..><c>...</c> tags
body = b[m.end():].strip()
body = re.sub(r'<\d{2}:\d{2}:\d{2}\.\d{3}>', '', body)
body = re.sub(r'</?c[^>]*>', '', body)
body = re.sub(r'align:\S+|position:\S+', '', body).strip()
# Last non-empty line is "new" content (rolling window puts the freshly spoken line at bottom)
lines = [ln.strip() for ln in body.split('\n') if ln.strip()]
if not lines: continue
yield start, end, lines[-1]
def to_srt_time(t):
return t.replace('.', ',')
def merge(events):
"""Drop the 10ms 'gap' cues and merge consecutive identical text."""
out = []
for s, e, txt in events:
# Skip the bridge cue with same text already on top
if out and out[-1][2] == txt:
out[-1] = (out[-1][0], to_srt_time(e), txt) # extend
continue
out.append([to_srt_time(s), to_srt_time(e), txt])
# second pass to drop micro-cues
final = []
for s, e, txt in out:
sh, sm, ssms = s.split(':'); ssec, sms = ssms.split(',')
eh, em, esms = e.split(':'); esec, ems = esms.split(',')
sm_total = int(sh)*3600+int(sm)*60+int(ssec)+int(sms)/1000
em_total = int(eh)*3600+int(em)*60+int(esec)+int(ems)/1000
if em_total - sm_total < 0.05: continue # 50ms bridge cue
final.append((s, e, txt))
return final
def write_srt(events, path):
with open(path, 'w') as f:
for i, (s, e, txt) in enumerate(events, 1):
f.write(f"{i}\n{s} --> {e}\n{txt}\n\n")
if __name__ == '__main__':
vtt = pathlib.Path(sys.argv[1]).read_text()
events = list(merge(parse_vtt(vtt)))
write_srt(events, sys.argv[2])
print(f"wrote {len(events)} cues -> {sys.argv[2]}")

View file

@ -0,0 +1,99 @@
# Subtitle run — `Sassy the Sasquatch (2022)`
Recipe version: v3.5 — YouTube auto-CC via yt-dlp + cleaner (v4 WhisperX planned, see ROADMAP)
Run date: 2026-05-10
Operator: Claude Code @ onyx session, ai-lab cwd
## Source
| Field | Value |
|---|---|
| Episodes | 5 (S01 only) |
| Container | mkv |
| Video | AV1 Main, 1920×1080, 29.97 fps |
| Audio | `eng` Opus stereo (default) |
| Embedded subs | none (only font / cover-art attachments) |
| Existing sidecars | none |
| Runtime | ~11:20 per episode |
| Distribution | YouTube (THE BIG LEZ SHOW OFFICIAL channel, creator: Jarrad Wright) |
Niche-show indie animation. Same channel hosts Donny & Clarence Show, Mike
Nolan Show, Big Lez Saga — all four shows in our library are Jarrad Wright
productions distributed YouTube-first.
## Series + library context
- Series Id: `b2d1afd8a4a30c59adb42ccaf47376c2`
- Library: `767bffe4f11c93ef34b805451a696a4e` (TV Shows, `/media/tv`)
- IMDB series: `tt21209936`
- TVDB series: `421839`
- Per-episode IMDB ids: only S01E01 (`tt21215354`) — rest blank in TVDB
## Coverage probe — paid + free providers
Three parallel research agents (2026-05-10) checked every realistic source
before falling back to YouTube:
| Provider | Hits |
|---|---|
| OpenSubtitles.com REST (`parent_imdb_id=21209936`) | 1 — `SASSY THE SASQUATCH.Web-DL.1080p.en` S01E01, **HI-flagged** |
| OpenSubtitles.org legacy XML-RPC | 0 (account login 401 anyway) |
| Addic7ed | 0 |
| SubDL | 0 (`subtitles_count: 0`) |
| SubSource (Subscene successor) | 0 |
| Podnapisi | 0 |
| OS VIP upgrade | **would not unlock anything** — VIP is download-cap relief, not coverage. Same catalog as free. |
Conclusion: nothing exists outside YouTube. Buying VIP would not help; the
honest path is auto-generated subs.
## Outcome
| Season | Eps | Subs fetched | Quality | Notes |
|---|---|---|---|---|
| S01 | 5 | 5 / 5 | YT auto-CC stop-gap (lowercase, no punctuation, names mangled) | Cleaned via `lib/yt-clean.py`. v4 WhisperX rebuild planned |
Net: **5 / 5 (100 %)** — but at the lowest tier of the USER-G quality bar.
## Pipeline used
1. `yt-dlp --skip-download --write-auto-subs --sub-langs en-orig` against
the official Sassy playlist (`PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj`) →
raw VTT per episode in `/tmp/sassy-research/`.
2. `lib/yt-clean.py` collapses the rolling-window VTT (each cue carries 2-3
stale lines plus the freshly-spoken bottom line) into deduplicated SRT.
3. SSH cat redirect each cleaned `.srt` to nullstone at
`/home/user/media/tv/Sassy the Sasquatch (2022)/Season 01/<base>.eng.srt`
with library filename.
4. Validation-only library refresh; verified all 5 eps show exactly 1
external eng sub stream.
Reusable pipeline now lives at `lib/sub-yt-fetch.sh` (wrapper) +
`lib/yt-clean.py` (cleaner). Same one-liner handles Donny & Clarence,
Mike Nolan, Big Lez Saga (all on the same channel).
## Quality known issues
- **Lowercase, no punctuation** — YT ASR output verbatim
- **Proper-noun mishears**: "Sassy" → `sasha`, "Big Lez" → `Big Less`
- **Profanity censored as `[ __ ]`** — passthrough from YT
- **Sentence segmentation absent** — cues split on word boundaries
These violate STYLE.md "best quality" and "clean" rules. Documented as
explicit stop-gap; v4 WhisperX rebuild restores quality bar.
## Mike Nolan special-case (deferred)
A YouTube upload titled "MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES"
posted Oct 2025 carries hand-typed CC tracks. When subbing Mike Nolan,
prefer that single video (rip CC tracks) over the per-episode auto-CC
playlist path. Note added to v4 roadmap.
## Followups
- [ ] visually verify one Sassy episode plays in sync (recipe §6) — YT
auto-cap timing is usually tight but worth a sanity check
- [ ] when v4 WhisperX lands, regenerate Sassy + Donny & Clarence + Big
Lez Saga + Mike Nolan in one batch on the 4080 friend node
- [ ] for Mike Nolan, try the "COMPLETE SEASON | SUBTITLES" YT upload
before falling back to Whisper