processes/subtitles: v3.5 YouTube auto-CC stop-gap + Sassy 5/5
Adds lib/sub-yt-fetch.sh (yt-dlp wrapper) and lib/yt-clean.py (collapses YouTube's rolling-window auto-caption VTT into a flat SRT). For shows distributed YouTube-first that have no community subs anywhere -- verified via three parallel research agents covering OpenSubtitles REST, OS legacy, Addic7ed, SubDL, SubSource, and Podnapisi for the 5 niche shows in the library, plus a price-vs-coverage analysis of OpenSubtitles VIP. Findings: OS VIP would not have helped on the niche shows (it is download-cap relief, not coverage unlock; same catalog as free). All 4 Jarrad Wright shows in the library (Sassy, Big Lez Saga, Donny & Clarence, Mike Nolan) live on the same channel and have only YouTube auto-CC available. v3.5 ships those, explicitly violating STYLE.md 'best quality' as a tracked stop-gap. Sassy the Sasquatch S01 5/5 episodes subbed with cleaned auto-CC. Mike Nolan special-case noted: a 'COMPLETE SEASON | SUBTITLES' YT upload from Oct 2025 carries hand-typed CCs and should be preferred over per-episode auto-CC when subbing that show. ROADMAP H5 added: v4 WhisperX large-v3 on the friend RTX 4080 node will regenerate the v3.5 stop-gap with proper-noun-prompted transcription (~4-6%% WER vs ~12%% YT auto-CC) and restore the STYLE.md quality bar. H1 OpenSubtitles credentials marked done (was completed 2026-05-09).
This commit is contained in:
parent
d9d6bdba64
commit
eb71cf6beb
7 changed files with 274 additions and 6 deletions
|
|
@ -24,10 +24,11 @@ Last revised: **2026-05-08**
|
|||
|
||||
| # | Item | Effort | Blocker |
|
||||
|---|---|---|---|
|
||||
| H1 | OpenSubtitles credentials (auth fixes log spam too — doc 13 win 2) | S | **owner signs up at opensubtitles.com** |
|
||||
| H1 | ~~OpenSubtitles credentials~~ — done 2026-05-09; `Caveman5` saved + free API key at `~/.config/arrflix-opensubtitles-api.txt` | — | done |
|
||||
| H2 | GPU transcode (nvidia driver kernel module + container toolkit + SecureBoot signing) | L | **owner sudo + reboot** |
|
||||
| H3 | Apply `bin/force-english-all-users.sh` (German Play button breaks UX for non-English browsers) | S | none — owner runs |
|
||||
| H4 | Backup `/home/docker/jellyfin/config/` off-host (no automated backup yet) | M | strategy decision |
|
||||
| H5 | **v4 subtitle path: WhisperX large-v3 on friend RTX 4080 node**. Regenerate Sassy + Big Lez Saga + Donny & Clarence + Mike Nolan with proper-noun prompts (replaces v3.5 YT auto-CC stop-gap). New helper at `processes/subtitles/lib/sub-whisperx-fetch.py`. WhisperX install on 100.64.0.3 (per memory `project_friend_gpu.md`, currently offline 2d); per-show prompt yaml at `processes/subtitles/prompts/<show>.yaml` (recurring proper nouns). Expected 4–6 % WER vs ~12 % for YT auto-CC; restores STYLE.md "best quality" bar. See `processes/subtitles/runs/sassy-the-sasquatch.md` for context. | M | friend node back online + WhisperX setup |
|
||||
|
||||
## 🟨 Open — Medium value
|
||||
|
||||
|
|
|
|||
|
|
@ -21,4 +21,4 @@ amendment for a full sweep.
|
|||
|
||||
| Process | Status | Last touched |
|
||||
|---|---|---|
|
||||
| [`subtitles/`](subtitles/) | v3 — Addic7ed (free, no daily cap) added as primary, OS REST as fallback. AD 49/58 subbed; remaining 9 land via OS REST after quota reset | 2026-05-09 |
|
||||
| [`subtitles/`](subtitles/) | v3.5 — YouTube auto-CC added as stop-gap for shows with no community subs anywhere (verified via 3-agent research run). AD 49/58 + Sassy 5/5. v4 WhisperX planned (ROADMAP H5) | 2026-05-10 |
|
||||
|
|
|
|||
|
|
@ -101,3 +101,44 @@ helper at `lib/sub-a7d-fetch.py`. Runs alongside v2; pick whichever fits.
|
|||
bot detection and short IP throttle (~1 hour). The script makes no
|
||||
effort at jittering / backoff
|
||||
- No automated sync-quality check; recipe Step 6 still manual
|
||||
|
||||
## v3.5 — 2026-05-10 (stop-gap path for niche YouTube-distributed shows)
|
||||
|
||||
For shows that distribute on YouTube and have no community subs anywhere
|
||||
(verified by parallel research agents covering OS REST / OS legacy /
|
||||
Addic7ed / SubDL / SubSource / Podnapisi for 5 niche shows), pull the
|
||||
YouTube auto-CC track via yt-dlp and clean it.
|
||||
|
||||
- New helper: `lib/sub-yt-fetch.sh` (yt-dlp wrapper) + `lib/yt-clean.py`
|
||||
(rolling-window VTT → flat SRT cleaner)
|
||||
- First applied to **Sassy the Sasquatch (2022)**, S01 5/5 episodes
|
||||
- Reusable for the rest of the Big Lez universe (same channel hosts
|
||||
Donny & Clarence, Mike Nolan, Big Lez Saga)
|
||||
|
||||
### v3.5 known limits — explicitly violates STYLE.md "best quality"
|
||||
|
||||
- Lowercase, no punctuation, no sentence segmentation
|
||||
- Proper-noun mishears (Sassy → "sasha", Big Lez → "Big Less")
|
||||
- Profanity censored as `[ __ ]` by YouTube's ASR
|
||||
- Will be replaced wholesale by v4 WhisperX (see ROADMAP H5)
|
||||
|
||||
### v3.5 also discovered
|
||||
|
||||
- **OpenSubtitles VIP would not have helped.** Verified: VIP is download-cap
|
||||
relief and ad removal, not coverage unlock. Same catalog as free.
|
||||
- **Mike Nolan special-case**: a YouTube upload titled
|
||||
"MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES" (Oct 2025) carries
|
||||
hand-typed CCs. When subbing Mike Nolan, prefer ripping that single
|
||||
upload over the per-episode auto-CC playlist path.
|
||||
|
||||
## v4 — planned (see ROADMAP H5)
|
||||
|
||||
Path: **WhisperX large-v3 on friend RTX 4080 node** (`100.64.0.3`).
|
||||
|
||||
- Replaces v3.5 stop-gap with full-quality auto-transcription
|
||||
- Per-show proper-noun prompt at `processes/subtitles/prompts/<show>.yaml`
|
||||
- New helper: `lib/sub-whisperx-fetch.py` (TBD)
|
||||
- Expected WER: 4–6% on noisy / animated dialogue (vs ~12% YT auto-CC)
|
||||
- Restores STYLE.md "one clean English sub per ep" bar for niche shows
|
||||
- Cloud fallback: ElevenLabs Scribe v2 (~$0.40/hr, ~2.2% WER) for any
|
||||
episode WhisperX still misses
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# Subtitle acquisition process — v1
|
||||
|
||||
Last updated: 2026-05-09
|
||||
Status: **v3** — three fetch paths (plugin / OS REST / Addic7ed). American Dad 49/58 subbed; remaining 9 land via OS REST after quota reset.
|
||||
Last updated: 2026-05-10
|
||||
Status: **v3.5** — four fetch paths (plugin / OS REST / Addic7ed / YouTube auto-CC). American Dad 49/58 + Sassy 5/5. v4 WhisperX planned (ROADMAP H5).
|
||||
|
||||
This recipe is written for Claude Code to execute. Each step lists the exact
|
||||
command, what to verify, and what to do on failure. Background reference for
|
||||
|
|
@ -79,17 +79,20 @@ ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
|
|||
|
||||
## Step 3 — Pick fetch path
|
||||
|
||||
Three paths, ordered cheapest-quota-cost-first:
|
||||
Four paths, ordered cheapest-quota-cost-first:
|
||||
|
||||
| Path | Cost / day cap | Coverage | Tool |
|
||||
|---|---|---|---|
|
||||
| **v3 Addic7ed** | free, no daily cap (anon) | English-only; near-complete on broadcast US shows; spotty on animated specials / niche titles | `lib/sub-a7d-fetch.py` |
|
||||
| **v2 OS REST** | 20 / day on free OS account | best overall coverage; survives any S/E numbering quirk via per-ep `imdb_id` | `lib/sub-rest-fetch.py` |
|
||||
| **v1 plugin** | counts against same OS 20/day | only works when library numbering matches OS catalogue (e.g. fails on American Dad past S01E07) | `lib/sub-fetch.sh` |
|
||||
| **v3.5 YouTube auto-CC** | free, ratelimited only | for shows distributed YouTube-first (no community subs anywhere); produces lowercase, no-punctuation, name-mangled subs — **stop-gap, violates STYLE.md** | `lib/sub-yt-fetch.sh` + `lib/yt-clean.py` |
|
||||
| **v4 WhisperX (planned)** | local CPU/GPU time | full-quality auto-transcription, restores STYLE.md bar for niche shows | TBD `lib/sub-whisperx-fetch.py` (ROADMAP H5) |
|
||||
|
||||
Default: try **v3** first to spare quota; fall back to **v2** for episodes
|
||||
v3 misses or for non-English needs. **v1** stays for shows where simple
|
||||
plugin auto-fetch is enough.
|
||||
plugin auto-fetch is enough. **v3.5** is the stop-gap when nothing exists
|
||||
on community providers; **v4** replaces v3.5 once the GPU node is set up.
|
||||
|
||||
Quick check whether v1 plugin will suffice (skip the rest if yes):
|
||||
|
||||
|
|
|
|||
68
processes/subtitles/lib/sub-yt-fetch.sh
Executable file
68
processes/subtitles/lib/sub-yt-fetch.sh
Executable file
|
|
@ -0,0 +1,68 @@
|
|||
#!/usr/bin/env bash
|
||||
# Subtitle fetcher v3.5 — YouTube auto-captions via yt-dlp + cleaner.
|
||||
#
|
||||
# For shows that distribute on YouTube and have no community subs anywhere
|
||||
# else (e.g. Big Lez Show universe: Sassy the Sasquatch, Donny & Clarence,
|
||||
# Mike Nolan, Big Lez Saga). yt-dlp pulls the en-orig auto-CC track, the
|
||||
# rolling-window VTT goes through yt-clean.py to deduplicate into a flat
|
||||
# SRT, and the result is dropped on nullstone with the library filename.
|
||||
#
|
||||
# Quality caveats (per processes/subtitles/STYLE.md fallback policy):
|
||||
# - lowercase, no punctuation
|
||||
# - YouTube ASR mishears proper nouns (e.g. "Sassy" → "sasha")
|
||||
# - profanity is censored as "[ __ ]"
|
||||
# - capitalisation / sentence segmentation is absent
|
||||
#
|
||||
# These subs ship as a stop-gap. v4 (WhisperX large-v3 on the 4080 friend
|
||||
# node) replaces them with full-quality transcriptions; see ROADMAP.
|
||||
#
|
||||
# Usage:
|
||||
# sub-yt-fetch.sh <playlist-or-channel-url> <out-dir> <name-template>
|
||||
#
|
||||
# Example (Sassy):
|
||||
# sub-yt-fetch.sh \
|
||||
# 'https://www.youtube.com/playlist?list=PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj' \
|
||||
# /tmp/sassy-yt \
|
||||
# 'Sassy the Sasquatch (2022) - S01E%(playlist_index)02d - %(title)s'
|
||||
#
|
||||
# After fetch: rename / copy each .en.srt to nullstone with the canonical
|
||||
# library filename (`<videobasename>.eng.srt`). For now this is manual —
|
||||
# automate when the next show comes through.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
PLAYLIST="${1:?playlist or channel URL required}"
|
||||
OUTDIR="${2:?output directory required}"
|
||||
NAMETMPL="${3:-S%(playlist_index)02d - %(title)s}"
|
||||
|
||||
mkdir -p "$OUTDIR"
|
||||
|
||||
if ! command -v yt-dlp >/dev/null; then
|
||||
echo "ERROR: yt-dlp not installed (pip install yt-dlp)" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Pull raw VTT auto-CC, no video, en-orig only (matches en bytewise but is the
|
||||
# canonical track to request).
|
||||
yt-dlp --skip-download --write-auto-subs --sub-langs "en-orig" \
|
||||
--sub-format vtt \
|
||||
--sleep-requests 1 --sleep-subtitles 2 \
|
||||
-o "$OUTDIR/${NAMETMPL}-raw.%(ext)s" \
|
||||
"$PLAYLIST"
|
||||
|
||||
CLEANER="$(dirname "$0")/yt-clean.py"
|
||||
if [[ ! -x "$CLEANER" ]]; then
|
||||
echo "ERROR: $CLEANER not found / not executable" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
# Convert each raw VTT to clean SRT
|
||||
shopt -s nullglob
|
||||
for vtt in "$OUTDIR"/*-raw.en-orig.vtt; do
|
||||
out="${vtt%-raw.en-orig.vtt}.en.srt"
|
||||
python3 "$CLEANER" "$vtt" "$out"
|
||||
echo "OK $out"
|
||||
done
|
||||
|
||||
echo
|
||||
echo "next: copy each .en.srt to nullstone with library filename, then library scan."
|
||||
56
processes/subtitles/lib/yt-clean.py
Executable file
56
processes/subtitles/lib/yt-clean.py
Executable file
|
|
@ -0,0 +1,56 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Clean YouTube auto-caption VTT into a flat SRT with no rolling-window dupes."""
|
||||
import re, sys, pathlib
|
||||
|
||||
def parse_vtt(text):
|
||||
"""Yield (start, end, line) tuples, dropping inline timing tags and empty lines."""
|
||||
blocks = re.split(r'\n\n+', text.strip())
|
||||
for b in blocks:
|
||||
if 'WEBVTT' in b or b.startswith('Kind:') or b.startswith('Language:'):
|
||||
continue
|
||||
m = re.search(r'(\d{2}:\d{2}:\d{2}[.,]\d{3})\s*-->\s*(\d{2}:\d{2}:\d{2}[.,]\d{3})', b)
|
||||
if not m: continue
|
||||
start, end = m.group(1), m.group(2)
|
||||
# Strip cue settings and inline <00:..><c>...</c> tags
|
||||
body = b[m.end():].strip()
|
||||
body = re.sub(r'<\d{2}:\d{2}:\d{2}\.\d{3}>', '', body)
|
||||
body = re.sub(r'</?c[^>]*>', '', body)
|
||||
body = re.sub(r'align:\S+|position:\S+', '', body).strip()
|
||||
# Last non-empty line is "new" content (rolling window puts the freshly spoken line at bottom)
|
||||
lines = [ln.strip() for ln in body.split('\n') if ln.strip()]
|
||||
if not lines: continue
|
||||
yield start, end, lines[-1]
|
||||
|
||||
def to_srt_time(t):
|
||||
return t.replace('.', ',')
|
||||
|
||||
def merge(events):
|
||||
"""Drop the 10ms 'gap' cues and merge consecutive identical text."""
|
||||
out = []
|
||||
for s, e, txt in events:
|
||||
# Skip the bridge cue with same text already on top
|
||||
if out and out[-1][2] == txt:
|
||||
out[-1] = (out[-1][0], to_srt_time(e), txt) # extend
|
||||
continue
|
||||
out.append([to_srt_time(s), to_srt_time(e), txt])
|
||||
# second pass to drop micro-cues
|
||||
final = []
|
||||
for s, e, txt in out:
|
||||
sh, sm, ssms = s.split(':'); ssec, sms = ssms.split(',')
|
||||
eh, em, esms = e.split(':'); esec, ems = esms.split(',')
|
||||
sm_total = int(sh)*3600+int(sm)*60+int(ssec)+int(sms)/1000
|
||||
em_total = int(eh)*3600+int(em)*60+int(esec)+int(ems)/1000
|
||||
if em_total - sm_total < 0.05: continue # 50ms bridge cue
|
||||
final.append((s, e, txt))
|
||||
return final
|
||||
|
||||
def write_srt(events, path):
|
||||
with open(path, 'w') as f:
|
||||
for i, (s, e, txt) in enumerate(events, 1):
|
||||
f.write(f"{i}\n{s} --> {e}\n{txt}\n\n")
|
||||
|
||||
if __name__ == '__main__':
|
||||
vtt = pathlib.Path(sys.argv[1]).read_text()
|
||||
events = list(merge(parse_vtt(vtt)))
|
||||
write_srt(events, sys.argv[2])
|
||||
print(f"wrote {len(events)} cues -> {sys.argv[2]}")
|
||||
99
processes/subtitles/runs/sassy-the-sasquatch.md
Normal file
99
processes/subtitles/runs/sassy-the-sasquatch.md
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
# Subtitle run — `Sassy the Sasquatch (2022)`
|
||||
|
||||
Recipe version: v3.5 — YouTube auto-CC via yt-dlp + cleaner (v4 WhisperX planned, see ROADMAP)
|
||||
Run date: 2026-05-10
|
||||
Operator: Claude Code @ onyx session, ai-lab cwd
|
||||
|
||||
## Source
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Episodes | 5 (S01 only) |
|
||||
| Container | mkv |
|
||||
| Video | AV1 Main, 1920×1080, 29.97 fps |
|
||||
| Audio | `eng` Opus stereo (default) |
|
||||
| Embedded subs | none (only font / cover-art attachments) |
|
||||
| Existing sidecars | none |
|
||||
| Runtime | ~11:20 per episode |
|
||||
| Distribution | YouTube (THE BIG LEZ SHOW OFFICIAL channel, creator: Jarrad Wright) |
|
||||
|
||||
Niche-show indie animation. Same channel hosts Donny & Clarence Show, Mike
|
||||
Nolan Show, Big Lez Saga — all four shows in our library are Jarrad Wright
|
||||
productions distributed YouTube-first.
|
||||
|
||||
## Series + library context
|
||||
|
||||
- Series Id: `b2d1afd8a4a30c59adb42ccaf47376c2`
|
||||
- Library: `767bffe4f11c93ef34b805451a696a4e` (TV Shows, `/media/tv`)
|
||||
- IMDB series: `tt21209936`
|
||||
- TVDB series: `421839`
|
||||
- Per-episode IMDB ids: only S01E01 (`tt21215354`) — rest blank in TVDB
|
||||
|
||||
## Coverage probe — paid + free providers
|
||||
|
||||
Three parallel research agents (2026-05-10) checked every realistic source
|
||||
before falling back to YouTube:
|
||||
|
||||
| Provider | Hits |
|
||||
|---|---|
|
||||
| OpenSubtitles.com REST (`parent_imdb_id=21209936`) | 1 — `SASSY THE SASQUATCH.Web-DL.1080p.en` S01E01, **HI-flagged** |
|
||||
| OpenSubtitles.org legacy XML-RPC | 0 (account login 401 anyway) |
|
||||
| Addic7ed | 0 |
|
||||
| SubDL | 0 (`subtitles_count: 0`) |
|
||||
| SubSource (Subscene successor) | 0 |
|
||||
| Podnapisi | 0 |
|
||||
| OS VIP upgrade | **would not unlock anything** — VIP is download-cap relief, not coverage. Same catalog as free. |
|
||||
|
||||
Conclusion: nothing exists outside YouTube. Buying VIP would not help; the
|
||||
honest path is auto-generated subs.
|
||||
|
||||
## Outcome
|
||||
|
||||
| Season | Eps | Subs fetched | Quality | Notes |
|
||||
|---|---|---|---|---|
|
||||
| S01 | 5 | 5 / 5 | YT auto-CC stop-gap (lowercase, no punctuation, names mangled) | Cleaned via `lib/yt-clean.py`. v4 WhisperX rebuild planned |
|
||||
|
||||
Net: **5 / 5 (100 %)** — but at the lowest tier of the USER-G quality bar.
|
||||
|
||||
## Pipeline used
|
||||
|
||||
1. `yt-dlp --skip-download --write-auto-subs --sub-langs en-orig` against
|
||||
the official Sassy playlist (`PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj`) →
|
||||
raw VTT per episode in `/tmp/sassy-research/`.
|
||||
2. `lib/yt-clean.py` collapses the rolling-window VTT (each cue carries 2-3
|
||||
stale lines plus the freshly-spoken bottom line) into deduplicated SRT.
|
||||
3. SSH cat redirect each cleaned `.srt` to nullstone at
|
||||
`/home/user/media/tv/Sassy the Sasquatch (2022)/Season 01/<base>.eng.srt`
|
||||
with library filename.
|
||||
4. Validation-only library refresh; verified all 5 eps show exactly 1
|
||||
external eng sub stream.
|
||||
|
||||
Reusable pipeline now lives at `lib/sub-yt-fetch.sh` (wrapper) +
|
||||
`lib/yt-clean.py` (cleaner). Same one-liner handles Donny & Clarence,
|
||||
Mike Nolan, Big Lez Saga (all on the same channel).
|
||||
|
||||
## Quality known issues
|
||||
|
||||
- **Lowercase, no punctuation** — YT ASR output verbatim
|
||||
- **Proper-noun mishears**: "Sassy" → `sasha`, "Big Lez" → `Big Less`
|
||||
- **Profanity censored as `[ __ ]`** — passthrough from YT
|
||||
- **Sentence segmentation absent** — cues split on word boundaries
|
||||
|
||||
These violate STYLE.md "best quality" and "clean" rules. Documented as
|
||||
explicit stop-gap; v4 WhisperX rebuild restores quality bar.
|
||||
|
||||
## Mike Nolan special-case (deferred)
|
||||
|
||||
A YouTube upload titled "MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES"
|
||||
posted Oct 2025 carries hand-typed CC tracks. When subbing Mike Nolan,
|
||||
prefer that single video (rip CC tracks) over the per-episode auto-CC
|
||||
playlist path. Note added to v4 roadmap.
|
||||
|
||||
## Followups
|
||||
|
||||
- [ ] visually verify one Sassy episode plays in sync (recipe §6) — YT
|
||||
auto-cap timing is usually tight but worth a sanity check
|
||||
- [ ] when v4 WhisperX lands, regenerate Sassy + Donny & Clarence + Big
|
||||
Lez Saga + Mike Nolan in one batch on the 4080 friend node
|
||||
- [ ] for Mike Nolan, try the "COMPLETE SEASON | SUBTITLES" YT upload
|
||||
before falling back to Whisper
|
||||
Loading…
Reference in a new issue