processes/subtitles: v3.5 YouTube auto-CC stop-gap + Sassy 5/5
Adds lib/sub-yt-fetch.sh (yt-dlp wrapper) and lib/yt-clean.py (collapses YouTube's rolling-window auto-caption VTT into a flat SRT). For shows distributed YouTube-first that have no community subs anywhere -- verified via three parallel research agents covering OpenSubtitles REST, OS legacy, Addic7ed, SubDL, SubSource, and Podnapisi for the 5 niche shows in the library, plus a price-vs-coverage analysis of OpenSubtitles VIP. Findings: OS VIP would not have helped on the niche shows (it is download-cap relief, not coverage unlock; same catalog as free). All 4 Jarrad Wright shows in the library (Sassy, Big Lez Saga, Donny & Clarence, Mike Nolan) live on the same channel and have only YouTube auto-CC available. v3.5 ships those, explicitly violating STYLE.md 'best quality' as a tracked stop-gap. Sassy the Sasquatch S01 5/5 episodes subbed with cleaned auto-CC. Mike Nolan special-case noted: a 'COMPLETE SEASON | SUBTITLES' YT upload from Oct 2025 carries hand-typed CCs and should be preferred over per-episode auto-CC when subbing that show. ROADMAP H5 added: v4 WhisperX large-v3 on the friend RTX 4080 node will regenerate the v3.5 stop-gap with proper-noun-prompted transcription (~4-6%% WER vs ~12%% YT auto-CC) and restore the STYLE.md quality bar. H1 OpenSubtitles credentials marked done (was completed 2026-05-09).
This commit is contained in:
parent
d9d6bdba64
commit
eb71cf6beb
7 changed files with 274 additions and 6 deletions
|
|
@ -24,10 +24,11 @@ Last revised: **2026-05-08**
|
||||||
|
|
||||||
| # | Item | Effort | Blocker |
|
| # | Item | Effort | Blocker |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| H1 | OpenSubtitles credentials (auth fixes log spam too — doc 13 win 2) | S | **owner signs up at opensubtitles.com** |
|
| H1 | ~~OpenSubtitles credentials~~ — done 2026-05-09; `Caveman5` saved + free API key at `~/.config/arrflix-opensubtitles-api.txt` | — | done |
|
||||||
| H2 | GPU transcode (nvidia driver kernel module + container toolkit + SecureBoot signing) | L | **owner sudo + reboot** |
|
| H2 | GPU transcode (nvidia driver kernel module + container toolkit + SecureBoot signing) | L | **owner sudo + reboot** |
|
||||||
| H3 | Apply `bin/force-english-all-users.sh` (German Play button breaks UX for non-English browsers) | S | none — owner runs |
|
| H3 | Apply `bin/force-english-all-users.sh` (German Play button breaks UX for non-English browsers) | S | none — owner runs |
|
||||||
| H4 | Backup `/home/docker/jellyfin/config/` off-host (no automated backup yet) | M | strategy decision |
|
| H4 | Backup `/home/docker/jellyfin/config/` off-host (no automated backup yet) | M | strategy decision |
|
||||||
|
| H5 | **v4 subtitle path: WhisperX large-v3 on friend RTX 4080 node**. Regenerate Sassy + Big Lez Saga + Donny & Clarence + Mike Nolan with proper-noun prompts (replaces v3.5 YT auto-CC stop-gap). New helper at `processes/subtitles/lib/sub-whisperx-fetch.py`. WhisperX install on 100.64.0.3 (per memory `project_friend_gpu.md`, currently offline 2d); per-show prompt yaml at `processes/subtitles/prompts/<show>.yaml` (recurring proper nouns). Expected 4–6 % WER vs ~12 % for YT auto-CC; restores STYLE.md "best quality" bar. See `processes/subtitles/runs/sassy-the-sasquatch.md` for context. | M | friend node back online + WhisperX setup |
|
||||||
|
|
||||||
## 🟨 Open — Medium value
|
## 🟨 Open — Medium value
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -21,4 +21,4 @@ amendment for a full sweep.
|
||||||
|
|
||||||
| Process | Status | Last touched |
|
| Process | Status | Last touched |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| [`subtitles/`](subtitles/) | v3 — Addic7ed (free, no daily cap) added as primary, OS REST as fallback. AD 49/58 subbed; remaining 9 land via OS REST after quota reset | 2026-05-09 |
|
| [`subtitles/`](subtitles/) | v3.5 — YouTube auto-CC added as stop-gap for shows with no community subs anywhere (verified via 3-agent research run). AD 49/58 + Sassy 5/5. v4 WhisperX planned (ROADMAP H5) | 2026-05-10 |
|
||||||
|
|
|
||||||
|
|
@ -101,3 +101,44 @@ helper at `lib/sub-a7d-fetch.py`. Runs alongside v2; pick whichever fits.
|
||||||
bot detection and short IP throttle (~1 hour). The script makes no
|
bot detection and short IP throttle (~1 hour). The script makes no
|
||||||
effort at jittering / backoff
|
effort at jittering / backoff
|
||||||
- No automated sync-quality check; recipe Step 6 still manual
|
- No automated sync-quality check; recipe Step 6 still manual
|
||||||
|
|
||||||
|
## v3.5 — 2026-05-10 (stop-gap path for niche YouTube-distributed shows)
|
||||||
|
|
||||||
|
For shows that distribute on YouTube and have no community subs anywhere
|
||||||
|
(verified by parallel research agents covering OS REST / OS legacy /
|
||||||
|
Addic7ed / SubDL / SubSource / Podnapisi for 5 niche shows), pull the
|
||||||
|
YouTube auto-CC track via yt-dlp and clean it.
|
||||||
|
|
||||||
|
- New helper: `lib/sub-yt-fetch.sh` (yt-dlp wrapper) + `lib/yt-clean.py`
|
||||||
|
(rolling-window VTT → flat SRT cleaner)
|
||||||
|
- First applied to **Sassy the Sasquatch (2022)**, S01 5/5 episodes
|
||||||
|
- Reusable for the rest of the Big Lez universe (same channel hosts
|
||||||
|
Donny & Clarence, Mike Nolan, Big Lez Saga)
|
||||||
|
|
||||||
|
### v3.5 known limits — explicitly violates STYLE.md "best quality"
|
||||||
|
|
||||||
|
- Lowercase, no punctuation, no sentence segmentation
|
||||||
|
- Proper-noun mishears (Sassy → "sasha", Big Lez → "Big Less")
|
||||||
|
- Profanity censored as `[ __ ]` by YouTube's ASR
|
||||||
|
- Will be replaced wholesale by v4 WhisperX (see ROADMAP H5)
|
||||||
|
|
||||||
|
### v3.5 also discovered
|
||||||
|
|
||||||
|
- **OpenSubtitles VIP would not have helped.** Verified: VIP is download-cap
|
||||||
|
relief and ad removal, not coverage unlock. Same catalog as free.
|
||||||
|
- **Mike Nolan special-case**: a YouTube upload titled
|
||||||
|
"MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES" (Oct 2025) carries
|
||||||
|
hand-typed CCs. When subbing Mike Nolan, prefer ripping that single
|
||||||
|
upload over the per-episode auto-CC playlist path.
|
||||||
|
|
||||||
|
## v4 — planned (see ROADMAP H5)
|
||||||
|
|
||||||
|
Path: **WhisperX large-v3 on friend RTX 4080 node** (`100.64.0.3`).
|
||||||
|
|
||||||
|
- Replaces v3.5 stop-gap with full-quality auto-transcription
|
||||||
|
- Per-show proper-noun prompt at `processes/subtitles/prompts/<show>.yaml`
|
||||||
|
- New helper: `lib/sub-whisperx-fetch.py` (TBD)
|
||||||
|
- Expected WER: 4–6% on noisy / animated dialogue (vs ~12% YT auto-CC)
|
||||||
|
- Restores STYLE.md "one clean English sub per ep" bar for niche shows
|
||||||
|
- Cloud fallback: ElevenLabs Scribe v2 (~$0.40/hr, ~2.2% WER) for any
|
||||||
|
episode WhisperX still misses
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,7 @@
|
||||||
# Subtitle acquisition process — v1
|
# Subtitle acquisition process — v1
|
||||||
|
|
||||||
Last updated: 2026-05-09
|
Last updated: 2026-05-10
|
||||||
Status: **v3** — three fetch paths (plugin / OS REST / Addic7ed). American Dad 49/58 subbed; remaining 9 land via OS REST after quota reset.
|
Status: **v3.5** — four fetch paths (plugin / OS REST / Addic7ed / YouTube auto-CC). American Dad 49/58 + Sassy 5/5. v4 WhisperX planned (ROADMAP H5).
|
||||||
|
|
||||||
This recipe is written for Claude Code to execute. Each step lists the exact
|
This recipe is written for Claude Code to execute. Each step lists the exact
|
||||||
command, what to verify, and what to do on failure. Background reference for
|
command, what to verify, and what to do on failure. Background reference for
|
||||||
|
|
@ -79,17 +79,20 @@ ssh user@192.168.0.100 "docker exec jellyfin curl -s -H 'X-Emby-Token: $TOK' \
|
||||||
|
|
||||||
## Step 3 — Pick fetch path
|
## Step 3 — Pick fetch path
|
||||||
|
|
||||||
Three paths, ordered cheapest-quota-cost-first:
|
Four paths, ordered cheapest-quota-cost-first:
|
||||||
|
|
||||||
| Path | Cost / day cap | Coverage | Tool |
|
| Path | Cost / day cap | Coverage | Tool |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| **v3 Addic7ed** | free, no daily cap (anon) | English-only; near-complete on broadcast US shows; spotty on animated specials / niche titles | `lib/sub-a7d-fetch.py` |
|
| **v3 Addic7ed** | free, no daily cap (anon) | English-only; near-complete on broadcast US shows; spotty on animated specials / niche titles | `lib/sub-a7d-fetch.py` |
|
||||||
| **v2 OS REST** | 20 / day on free OS account | best overall coverage; survives any S/E numbering quirk via per-ep `imdb_id` | `lib/sub-rest-fetch.py` |
|
| **v2 OS REST** | 20 / day on free OS account | best overall coverage; survives any S/E numbering quirk via per-ep `imdb_id` | `lib/sub-rest-fetch.py` |
|
||||||
| **v1 plugin** | counts against same OS 20/day | only works when library numbering matches OS catalogue (e.g. fails on American Dad past S01E07) | `lib/sub-fetch.sh` |
|
| **v1 plugin** | counts against same OS 20/day | only works when library numbering matches OS catalogue (e.g. fails on American Dad past S01E07) | `lib/sub-fetch.sh` |
|
||||||
|
| **v3.5 YouTube auto-CC** | free, ratelimited only | for shows distributed YouTube-first (no community subs anywhere); produces lowercase, no-punctuation, name-mangled subs — **stop-gap, violates STYLE.md** | `lib/sub-yt-fetch.sh` + `lib/yt-clean.py` |
|
||||||
|
| **v4 WhisperX (planned)** | local CPU/GPU time | full-quality auto-transcription, restores STYLE.md bar for niche shows | TBD `lib/sub-whisperx-fetch.py` (ROADMAP H5) |
|
||||||
|
|
||||||
Default: try **v3** first to spare quota; fall back to **v2** for episodes
|
Default: try **v3** first to spare quota; fall back to **v2** for episodes
|
||||||
v3 misses or for non-English needs. **v1** stays for shows where simple
|
v3 misses or for non-English needs. **v1** stays for shows where simple
|
||||||
plugin auto-fetch is enough.
|
plugin auto-fetch is enough. **v3.5** is the stop-gap when nothing exists
|
||||||
|
on community providers; **v4** replaces v3.5 once the GPU node is set up.
|
||||||
|
|
||||||
Quick check whether v1 plugin will suffice (skip the rest if yes):
|
Quick check whether v1 plugin will suffice (skip the rest if yes):
|
||||||
|
|
||||||
|
|
|
||||||
68
processes/subtitles/lib/sub-yt-fetch.sh
Executable file
68
processes/subtitles/lib/sub-yt-fetch.sh
Executable file
|
|
@ -0,0 +1,68 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Subtitle fetcher v3.5 — YouTube auto-captions via yt-dlp + cleaner.
|
||||||
|
#
|
||||||
|
# For shows that distribute on YouTube and have no community subs anywhere
|
||||||
|
# else (e.g. Big Lez Show universe: Sassy the Sasquatch, Donny & Clarence,
|
||||||
|
# Mike Nolan, Big Lez Saga). yt-dlp pulls the en-orig auto-CC track, the
|
||||||
|
# rolling-window VTT goes through yt-clean.py to deduplicate into a flat
|
||||||
|
# SRT, and the result is dropped on nullstone with the library filename.
|
||||||
|
#
|
||||||
|
# Quality caveats (per processes/subtitles/STYLE.md fallback policy):
|
||||||
|
# - lowercase, no punctuation
|
||||||
|
# - YouTube ASR mishears proper nouns (e.g. "Sassy" → "sasha")
|
||||||
|
# - profanity is censored as "[ __ ]"
|
||||||
|
# - capitalisation / sentence segmentation is absent
|
||||||
|
#
|
||||||
|
# These subs ship as a stop-gap. v4 (WhisperX large-v3 on the 4080 friend
|
||||||
|
# node) replaces them with full-quality transcriptions; see ROADMAP.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# sub-yt-fetch.sh <playlist-or-channel-url> <out-dir> <name-template>
|
||||||
|
#
|
||||||
|
# Example (Sassy):
|
||||||
|
# sub-yt-fetch.sh \
|
||||||
|
# 'https://www.youtube.com/playlist?list=PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj' \
|
||||||
|
# /tmp/sassy-yt \
|
||||||
|
# 'Sassy the Sasquatch (2022) - S01E%(playlist_index)02d - %(title)s'
|
||||||
|
#
|
||||||
|
# After fetch: rename / copy each .en.srt to nullstone with the canonical
|
||||||
|
# library filename (`<videobasename>.eng.srt`). For now this is manual —
|
||||||
|
# automate when the next show comes through.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
PLAYLIST="${1:?playlist or channel URL required}"
|
||||||
|
OUTDIR="${2:?output directory required}"
|
||||||
|
NAMETMPL="${3:-S%(playlist_index)02d - %(title)s}"
|
||||||
|
|
||||||
|
mkdir -p "$OUTDIR"
|
||||||
|
|
||||||
|
if ! command -v yt-dlp >/dev/null; then
|
||||||
|
echo "ERROR: yt-dlp not installed (pip install yt-dlp)" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Pull raw VTT auto-CC, no video, en-orig only (matches en bytewise but is the
|
||||||
|
# canonical track to request).
|
||||||
|
yt-dlp --skip-download --write-auto-subs --sub-langs "en-orig" \
|
||||||
|
--sub-format vtt \
|
||||||
|
--sleep-requests 1 --sleep-subtitles 2 \
|
||||||
|
-o "$OUTDIR/${NAMETMPL}-raw.%(ext)s" \
|
||||||
|
"$PLAYLIST"
|
||||||
|
|
||||||
|
CLEANER="$(dirname "$0")/yt-clean.py"
|
||||||
|
if [[ ! -x "$CLEANER" ]]; then
|
||||||
|
echo "ERROR: $CLEANER not found / not executable" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Convert each raw VTT to clean SRT
|
||||||
|
shopt -s nullglob
|
||||||
|
for vtt in "$OUTDIR"/*-raw.en-orig.vtt; do
|
||||||
|
out="${vtt%-raw.en-orig.vtt}.en.srt"
|
||||||
|
python3 "$CLEANER" "$vtt" "$out"
|
||||||
|
echo "OK $out"
|
||||||
|
done
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "next: copy each .en.srt to nullstone with library filename, then library scan."
|
||||||
56
processes/subtitles/lib/yt-clean.py
Executable file
56
processes/subtitles/lib/yt-clean.py
Executable file
|
|
@ -0,0 +1,56 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Clean YouTube auto-caption VTT into a flat SRT with no rolling-window dupes."""
|
||||||
|
import re, sys, pathlib
|
||||||
|
|
||||||
|
def parse_vtt(text):
|
||||||
|
"""Yield (start, end, line) tuples, dropping inline timing tags and empty lines."""
|
||||||
|
blocks = re.split(r'\n\n+', text.strip())
|
||||||
|
for b in blocks:
|
||||||
|
if 'WEBVTT' in b or b.startswith('Kind:') or b.startswith('Language:'):
|
||||||
|
continue
|
||||||
|
m = re.search(r'(\d{2}:\d{2}:\d{2}[.,]\d{3})\s*-->\s*(\d{2}:\d{2}:\d{2}[.,]\d{3})', b)
|
||||||
|
if not m: continue
|
||||||
|
start, end = m.group(1), m.group(2)
|
||||||
|
# Strip cue settings and inline <00:..><c>...</c> tags
|
||||||
|
body = b[m.end():].strip()
|
||||||
|
body = re.sub(r'<\d{2}:\d{2}:\d{2}\.\d{3}>', '', body)
|
||||||
|
body = re.sub(r'</?c[^>]*>', '', body)
|
||||||
|
body = re.sub(r'align:\S+|position:\S+', '', body).strip()
|
||||||
|
# Last non-empty line is "new" content (rolling window puts the freshly spoken line at bottom)
|
||||||
|
lines = [ln.strip() for ln in body.split('\n') if ln.strip()]
|
||||||
|
if not lines: continue
|
||||||
|
yield start, end, lines[-1]
|
||||||
|
|
||||||
|
def to_srt_time(t):
|
||||||
|
return t.replace('.', ',')
|
||||||
|
|
||||||
|
def merge(events):
|
||||||
|
"""Drop the 10ms 'gap' cues and merge consecutive identical text."""
|
||||||
|
out = []
|
||||||
|
for s, e, txt in events:
|
||||||
|
# Skip the bridge cue with same text already on top
|
||||||
|
if out and out[-1][2] == txt:
|
||||||
|
out[-1] = (out[-1][0], to_srt_time(e), txt) # extend
|
||||||
|
continue
|
||||||
|
out.append([to_srt_time(s), to_srt_time(e), txt])
|
||||||
|
# second pass to drop micro-cues
|
||||||
|
final = []
|
||||||
|
for s, e, txt in out:
|
||||||
|
sh, sm, ssms = s.split(':'); ssec, sms = ssms.split(',')
|
||||||
|
eh, em, esms = e.split(':'); esec, ems = esms.split(',')
|
||||||
|
sm_total = int(sh)*3600+int(sm)*60+int(ssec)+int(sms)/1000
|
||||||
|
em_total = int(eh)*3600+int(em)*60+int(esec)+int(ems)/1000
|
||||||
|
if em_total - sm_total < 0.05: continue # 50ms bridge cue
|
||||||
|
final.append((s, e, txt))
|
||||||
|
return final
|
||||||
|
|
||||||
|
def write_srt(events, path):
|
||||||
|
with open(path, 'w') as f:
|
||||||
|
for i, (s, e, txt) in enumerate(events, 1):
|
||||||
|
f.write(f"{i}\n{s} --> {e}\n{txt}\n\n")
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
vtt = pathlib.Path(sys.argv[1]).read_text()
|
||||||
|
events = list(merge(parse_vtt(vtt)))
|
||||||
|
write_srt(events, sys.argv[2])
|
||||||
|
print(f"wrote {len(events)} cues -> {sys.argv[2]}")
|
||||||
99
processes/subtitles/runs/sassy-the-sasquatch.md
Normal file
99
processes/subtitles/runs/sassy-the-sasquatch.md
Normal file
|
|
@ -0,0 +1,99 @@
|
||||||
|
# Subtitle run — `Sassy the Sasquatch (2022)`
|
||||||
|
|
||||||
|
Recipe version: v3.5 — YouTube auto-CC via yt-dlp + cleaner (v4 WhisperX planned, see ROADMAP)
|
||||||
|
Run date: 2026-05-10
|
||||||
|
Operator: Claude Code @ onyx session, ai-lab cwd
|
||||||
|
|
||||||
|
## Source
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|---|---|
|
||||||
|
| Episodes | 5 (S01 only) |
|
||||||
|
| Container | mkv |
|
||||||
|
| Video | AV1 Main, 1920×1080, 29.97 fps |
|
||||||
|
| Audio | `eng` Opus stereo (default) |
|
||||||
|
| Embedded subs | none (only font / cover-art attachments) |
|
||||||
|
| Existing sidecars | none |
|
||||||
|
| Runtime | ~11:20 per episode |
|
||||||
|
| Distribution | YouTube (THE BIG LEZ SHOW OFFICIAL channel, creator: Jarrad Wright) |
|
||||||
|
|
||||||
|
Niche-show indie animation. Same channel hosts Donny & Clarence Show, Mike
|
||||||
|
Nolan Show, Big Lez Saga — all four shows in our library are Jarrad Wright
|
||||||
|
productions distributed YouTube-first.
|
||||||
|
|
||||||
|
## Series + library context
|
||||||
|
|
||||||
|
- Series Id: `b2d1afd8a4a30c59adb42ccaf47376c2`
|
||||||
|
- Library: `767bffe4f11c93ef34b805451a696a4e` (TV Shows, `/media/tv`)
|
||||||
|
- IMDB series: `tt21209936`
|
||||||
|
- TVDB series: `421839`
|
||||||
|
- Per-episode IMDB ids: only S01E01 (`tt21215354`) — rest blank in TVDB
|
||||||
|
|
||||||
|
## Coverage probe — paid + free providers
|
||||||
|
|
||||||
|
Three parallel research agents (2026-05-10) checked every realistic source
|
||||||
|
before falling back to YouTube:
|
||||||
|
|
||||||
|
| Provider | Hits |
|
||||||
|
|---|---|
|
||||||
|
| OpenSubtitles.com REST (`parent_imdb_id=21209936`) | 1 — `SASSY THE SASQUATCH.Web-DL.1080p.en` S01E01, **HI-flagged** |
|
||||||
|
| OpenSubtitles.org legacy XML-RPC | 0 (account login 401 anyway) |
|
||||||
|
| Addic7ed | 0 |
|
||||||
|
| SubDL | 0 (`subtitles_count: 0`) |
|
||||||
|
| SubSource (Subscene successor) | 0 |
|
||||||
|
| Podnapisi | 0 |
|
||||||
|
| OS VIP upgrade | **would not unlock anything** — VIP is download-cap relief, not coverage. Same catalog as free. |
|
||||||
|
|
||||||
|
Conclusion: nothing exists outside YouTube. Buying VIP would not help; the
|
||||||
|
honest path is auto-generated subs.
|
||||||
|
|
||||||
|
## Outcome
|
||||||
|
|
||||||
|
| Season | Eps | Subs fetched | Quality | Notes |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| S01 | 5 | 5 / 5 | YT auto-CC stop-gap (lowercase, no punctuation, names mangled) | Cleaned via `lib/yt-clean.py`. v4 WhisperX rebuild planned |
|
||||||
|
|
||||||
|
Net: **5 / 5 (100 %)** — but at the lowest tier of the USER-G quality bar.
|
||||||
|
|
||||||
|
## Pipeline used
|
||||||
|
|
||||||
|
1. `yt-dlp --skip-download --write-auto-subs --sub-langs en-orig` against
|
||||||
|
the official Sassy playlist (`PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj`) →
|
||||||
|
raw VTT per episode in `/tmp/sassy-research/`.
|
||||||
|
2. `lib/yt-clean.py` collapses the rolling-window VTT (each cue carries 2-3
|
||||||
|
stale lines plus the freshly-spoken bottom line) into deduplicated SRT.
|
||||||
|
3. SSH cat redirect each cleaned `.srt` to nullstone at
|
||||||
|
`/home/user/media/tv/Sassy the Sasquatch (2022)/Season 01/<base>.eng.srt`
|
||||||
|
with library filename.
|
||||||
|
4. Validation-only library refresh; verified all 5 eps show exactly 1
|
||||||
|
external eng sub stream.
|
||||||
|
|
||||||
|
Reusable pipeline now lives at `lib/sub-yt-fetch.sh` (wrapper) +
|
||||||
|
`lib/yt-clean.py` (cleaner). Same one-liner handles Donny & Clarence,
|
||||||
|
Mike Nolan, Big Lez Saga (all on the same channel).
|
||||||
|
|
||||||
|
## Quality known issues
|
||||||
|
|
||||||
|
- **Lowercase, no punctuation** — YT ASR output verbatim
|
||||||
|
- **Proper-noun mishears**: "Sassy" → `sasha`, "Big Lez" → `Big Less`
|
||||||
|
- **Profanity censored as `[ __ ]`** — passthrough from YT
|
||||||
|
- **Sentence segmentation absent** — cues split on word boundaries
|
||||||
|
|
||||||
|
These violate STYLE.md "best quality" and "clean" rules. Documented as
|
||||||
|
explicit stop-gap; v4 WhisperX rebuild restores quality bar.
|
||||||
|
|
||||||
|
## Mike Nolan special-case (deferred)
|
||||||
|
|
||||||
|
A YouTube upload titled "MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES"
|
||||||
|
posted Oct 2025 carries hand-typed CC tracks. When subbing Mike Nolan,
|
||||||
|
prefer that single video (rip CC tracks) over the per-episode auto-CC
|
||||||
|
playlist path. Note added to v4 roadmap.
|
||||||
|
|
||||||
|
## Followups
|
||||||
|
|
||||||
|
- [ ] visually verify one Sassy episode plays in sync (recipe §6) — YT
|
||||||
|
auto-cap timing is usually tight but worth a sanity check
|
||||||
|
- [ ] when v4 WhisperX lands, regenerate Sassy + Donny & Clarence + Big
|
||||||
|
Lez Saga + Mike Nolan in one batch on the 4080 friend node
|
||||||
|
- [ ] for Mike Nolan, try the "COMPLETE SEASON | SUBTITLES" YT upload
|
||||||
|
before falling back to Whisper
|
||||||
Loading…
Reference in a new issue