Adds lib/sub-yt-fetch.sh (yt-dlp wrapper) and lib/yt-clean.py (collapses YouTube's rolling-window auto-caption VTT into a flat SRT). For shows distributed YouTube-first that have no community subs anywhere -- verified via three parallel research agents covering OpenSubtitles REST, OS legacy, Addic7ed, SubDL, SubSource, and Podnapisi for the 5 niche shows in the library, plus a price-vs-coverage analysis of OpenSubtitles VIP. Findings: OS VIP would not have helped on the niche shows (it is download-cap relief, not coverage unlock; same catalog as free). All 4 Jarrad Wright shows in the library (Sassy, Big Lez Saga, Donny & Clarence, Mike Nolan) live on the same channel and have only YouTube auto-CC available. v3.5 ships those, explicitly violating STYLE.md 'best quality' as a tracked stop-gap. Sassy the Sasquatch S01 5/5 episodes subbed with cleaned auto-CC. Mike Nolan special-case noted: a 'COMPLETE SEASON | SUBTITLES' YT upload from Oct 2025 carries hand-typed CCs and should be preferred over per-episode auto-CC when subbing that show. ROADMAP H5 added: v4 WhisperX large-v3 on the friend RTX 4080 node will regenerate the v3.5 stop-gap with proper-noun-prompted transcription (~4-6%% WER vs ~12%% YT auto-CC) and restore the STYLE.md quality bar. H1 OpenSubtitles credentials marked done (was completed 2026-05-09).
68 lines
2.4 KiB
Bash
Executable file
68 lines
2.4 KiB
Bash
Executable file
#!/usr/bin/env bash
|
|
# Subtitle fetcher v3.5 — YouTube auto-captions via yt-dlp + cleaner.
|
|
#
|
|
# For shows that distribute on YouTube and have no community subs anywhere
|
|
# else (e.g. Big Lez Show universe: Sassy the Sasquatch, Donny & Clarence,
|
|
# Mike Nolan, Big Lez Saga). yt-dlp pulls the en-orig auto-CC track, the
|
|
# rolling-window VTT goes through yt-clean.py to deduplicate into a flat
|
|
# SRT, and the result is dropped on nullstone with the library filename.
|
|
#
|
|
# Quality caveats (per processes/subtitles/STYLE.md fallback policy):
|
|
# - lowercase, no punctuation
|
|
# - YouTube ASR mishears proper nouns (e.g. "Sassy" → "sasha")
|
|
# - profanity is censored as "[ __ ]"
|
|
# - capitalisation / sentence segmentation is absent
|
|
#
|
|
# These subs ship as a stop-gap. v4 (WhisperX large-v3 on the 4080 friend
|
|
# node) replaces them with full-quality transcriptions; see ROADMAP.
|
|
#
|
|
# Usage:
|
|
# sub-yt-fetch.sh <playlist-or-channel-url> <out-dir> <name-template>
|
|
#
|
|
# Example (Sassy):
|
|
# sub-yt-fetch.sh \
|
|
# 'https://www.youtube.com/playlist?list=PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj' \
|
|
# /tmp/sassy-yt \
|
|
# 'Sassy the Sasquatch (2022) - S01E%(playlist_index)02d - %(title)s'
|
|
#
|
|
# After fetch: rename / copy each .en.srt to nullstone with the canonical
|
|
# library filename (`<videobasename>.eng.srt`). For now this is manual —
|
|
# automate when the next show comes through.
|
|
|
|
set -euo pipefail
|
|
|
|
PLAYLIST="${1:?playlist or channel URL required}"
|
|
OUTDIR="${2:?output directory required}"
|
|
NAMETMPL="${3:-S%(playlist_index)02d - %(title)s}"
|
|
|
|
mkdir -p "$OUTDIR"
|
|
|
|
if ! command -v yt-dlp >/dev/null; then
|
|
echo "ERROR: yt-dlp not installed (pip install yt-dlp)" >&2
|
|
exit 1
|
|
fi
|
|
|
|
# Pull raw VTT auto-CC, no video, en-orig only (matches en bytewise but is the
|
|
# canonical track to request).
|
|
yt-dlp --skip-download --write-auto-subs --sub-langs "en-orig" \
|
|
--sub-format vtt \
|
|
--sleep-requests 1 --sleep-subtitles 2 \
|
|
-o "$OUTDIR/${NAMETMPL}-raw.%(ext)s" \
|
|
"$PLAYLIST"
|
|
|
|
CLEANER="$(dirname "$0")/yt-clean.py"
|
|
if [[ ! -x "$CLEANER" ]]; then
|
|
echo "ERROR: $CLEANER not found / not executable" >&2
|
|
exit 2
|
|
fi
|
|
|
|
# Convert each raw VTT to clean SRT
|
|
shopt -s nullglob
|
|
for vtt in "$OUTDIR"/*-raw.en-orig.vtt; do
|
|
out="${vtt%-raw.en-orig.vtt}.en.srt"
|
|
python3 "$CLEANER" "$vtt" "$out"
|
|
echo "OK $out"
|
|
done
|
|
|
|
echo
|
|
echo "next: copy each .en.srt to nullstone with library filename, then library scan."
|