legacy-arrflix/processes/subtitles/runs/sassy-the-sasquatch.md
s8n eb71cf6beb processes/subtitles: v3.5 YouTube auto-CC stop-gap + Sassy 5/5
Adds lib/sub-yt-fetch.sh (yt-dlp wrapper) and lib/yt-clean.py (collapses
YouTube's rolling-window auto-caption VTT into a flat SRT). For shows
distributed YouTube-first that have no community subs anywhere -- verified
via three parallel research agents covering OpenSubtitles REST, OS legacy,
Addic7ed, SubDL, SubSource, and Podnapisi for the 5 niche shows in the
library, plus a price-vs-coverage analysis of OpenSubtitles VIP.

Findings: OS VIP would not have helped on the niche shows (it is
download-cap relief, not coverage unlock; same catalog as free). All 4
Jarrad Wright shows in the library (Sassy, Big Lez Saga, Donny &
Clarence, Mike Nolan) live on the same channel and have only YouTube
auto-CC available. v3.5 ships those, explicitly violating STYLE.md
'best quality' as a tracked stop-gap.

Sassy the Sasquatch S01 5/5 episodes subbed with cleaned auto-CC. Mike
Nolan special-case noted: a 'COMPLETE SEASON | SUBTITLES' YT upload from
Oct 2025 carries hand-typed CCs and should be preferred over per-episode
auto-CC when subbing that show.

ROADMAP H5 added: v4 WhisperX large-v3 on the friend RTX 4080 node will
regenerate the v3.5 stop-gap with proper-noun-prompted transcription
(~4-6%% WER vs ~12%% YT auto-CC) and restore the STYLE.md quality bar.
H1 OpenSubtitles credentials marked done (was completed 2026-05-09).
2026-05-10 01:05:07 +01:00

99 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Subtitle run — `Sassy the Sasquatch (2022)`
Recipe version: v3.5 — YouTube auto-CC via yt-dlp + cleaner (v4 WhisperX planned, see ROADMAP)
Run date: 2026-05-10
Operator: Claude Code @ onyx session, ai-lab cwd
## Source
| Field | Value |
|---|---|
| Episodes | 5 (S01 only) |
| Container | mkv |
| Video | AV1 Main, 1920×1080, 29.97 fps |
| Audio | `eng` Opus stereo (default) |
| Embedded subs | none (only font / cover-art attachments) |
| Existing sidecars | none |
| Runtime | ~11:20 per episode |
| Distribution | YouTube (THE BIG LEZ SHOW OFFICIAL channel, creator: Jarrad Wright) |
Niche-show indie animation. Same channel hosts Donny & Clarence Show, Mike
Nolan Show, Big Lez Saga — all four shows in our library are Jarrad Wright
productions distributed YouTube-first.
## Series + library context
- Series Id: `b2d1afd8a4a30c59adb42ccaf47376c2`
- Library: `767bffe4f11c93ef34b805451a696a4e` (TV Shows, `/media/tv`)
- IMDB series: `tt21209936`
- TVDB series: `421839`
- Per-episode IMDB ids: only S01E01 (`tt21215354`) — rest blank in TVDB
## Coverage probe — paid + free providers
Three parallel research agents (2026-05-10) checked every realistic source
before falling back to YouTube:
| Provider | Hits |
|---|---|
| OpenSubtitles.com REST (`parent_imdb_id=21209936`) | 1 — `SASSY THE SASQUATCH.Web-DL.1080p.en` S01E01, **HI-flagged** |
| OpenSubtitles.org legacy XML-RPC | 0 (account login 401 anyway) |
| Addic7ed | 0 |
| SubDL | 0 (`subtitles_count: 0`) |
| SubSource (Subscene successor) | 0 |
| Podnapisi | 0 |
| OS VIP upgrade | **would not unlock anything** — VIP is download-cap relief, not coverage. Same catalog as free. |
Conclusion: nothing exists outside YouTube. Buying VIP would not help; the
honest path is auto-generated subs.
## Outcome
| Season | Eps | Subs fetched | Quality | Notes |
|---|---|---|---|---|
| S01 | 5 | 5 / 5 | YT auto-CC stop-gap (lowercase, no punctuation, names mangled) | Cleaned via `lib/yt-clean.py`. v4 WhisperX rebuild planned |
Net: **5 / 5 (100 %)** — but at the lowest tier of the USER-G quality bar.
## Pipeline used
1. `yt-dlp --skip-download --write-auto-subs --sub-langs en-orig` against
the official Sassy playlist (`PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj`) →
raw VTT per episode in `/tmp/sassy-research/`.
2. `lib/yt-clean.py` collapses the rolling-window VTT (each cue carries 2-3
stale lines plus the freshly-spoken bottom line) into deduplicated SRT.
3. SSH cat redirect each cleaned `.srt` to nullstone at
`/home/user/media/tv/Sassy the Sasquatch (2022)/Season 01/<base>.eng.srt`
with library filename.
4. Validation-only library refresh; verified all 5 eps show exactly 1
external eng sub stream.
Reusable pipeline now lives at `lib/sub-yt-fetch.sh` (wrapper) +
`lib/yt-clean.py` (cleaner). Same one-liner handles Donny & Clarence,
Mike Nolan, Big Lez Saga (all on the same channel).
## Quality known issues
- **Lowercase, no punctuation** — YT ASR output verbatim
- **Proper-noun mishears**: "Sassy" → `sasha`, "Big Lez" → `Big Less`
- **Profanity censored as `[ __ ]`** — passthrough from YT
- **Sentence segmentation absent** — cues split on word boundaries
These violate STYLE.md "best quality" and "clean" rules. Documented as
explicit stop-gap; v4 WhisperX rebuild restores quality bar.
## Mike Nolan special-case (deferred)
A YouTube upload titled "MIKE NOLAN SHOW | COMPLETE SEASON | SUBTITLES"
posted Oct 2025 carries hand-typed CC tracks. When subbing Mike Nolan,
prefer that single video (rip CC tracks) over the per-episode auto-CC
playlist path. Note added to v4 roadmap.
## Followups
- [ ] visually verify one Sassy episode plays in sync (recipe §6) — YT
auto-cap timing is usually tight but worth a sanity check
- [ ] when v4 WhisperX lands, regenerate Sassy + Donny & Clarence + Big
Lez Saga + Mike Nolan in one batch on the 4080 friend node
- [ ] for Mike Nolan, try the "COMPLETE SEASON | SUBTITLES" YT upload
before falling back to Whisper