legacy-arrflix/playbooks/subtitles/lib/sub-yt-fetch.sh
s8n 24a9497e7d playbooks/ rename + import-media v1.0 + lilo&stitch run
processes/ -> playbooks/ (git mv preserves history; updated cross-refs
in ROADMAP, README, subtitles playbook + scripts).

playbooks/import-media/README.md v1.0 — 7-step import workflow:
  stage on onyx -> rsync to nullstone -> chmod -> verify scan ->
  Items/Counts bump -> optional subtitle pass -> run-log
Cross-references docs/05/07/08, ADMIN-GUIDE, README. Mirrors the
existing subtitles playbook structure (CHANGELOG + runs/_template).

CHANGELOG v1.0 lists known gaps (bin/cleanup-import.sh and
bin/normalize.py still doc-only, ROADMAP M6).

First run logged: playbooks/import-media/runs/lilo-stitch-2002.md.
Lilo & Stitch (2002) imported to /home/user/media/movies/, item
c2f4aff133c1b9631500fadf293b0b2f, TMDb 11544, MovieCount 3 -> 4.
LibraryMonitor didn't auto-fire — needed manual /Library/Refresh;
playbook updated to make this an unconditional step.

Source: 1080p BluRay HEVC 10-bit / EAC3 5.1 / 2x PGS embedded subs.
Per quality bar (README.md:41) — passes.
2026-05-10 02:29:57 +01:00

68 lines
2.4 KiB
Bash
Executable file

#!/usr/bin/env bash
# Subtitle fetcher v3.5 — YouTube auto-captions via yt-dlp + cleaner.
#
# For shows that distribute on YouTube and have no community subs anywhere
# else (e.g. Big Lez Show universe: Sassy the Sasquatch, Donny & Clarence,
# Mike Nolan, Big Lez Saga). yt-dlp pulls the en-orig auto-CC track, the
# rolling-window VTT goes through yt-clean.py to deduplicate into a flat
# SRT, and the result is dropped on nullstone with the library filename.
#
# Quality caveats (per playbooks/subtitles/STYLE.md fallback policy):
# - lowercase, no punctuation
# - YouTube ASR mishears proper nouns (e.g. "Sassy" → "sasha")
# - profanity is censored as "[ __ ]"
# - capitalisation / sentence segmentation is absent
#
# These subs ship as a stop-gap. v4 (WhisperX large-v3 on the 4080 friend
# node) replaces them with full-quality transcriptions; see ROADMAP.
#
# Usage:
# sub-yt-fetch.sh <playlist-or-channel-url> <out-dir> <name-template>
#
# Example (Sassy):
# sub-yt-fetch.sh \
# 'https://www.youtube.com/playlist?list=PLGMC7oz7XpmDMGrALMQiNXCi9p7aqkWbj' \
# /tmp/sassy-yt \
# 'Sassy the Sasquatch (2022) - S01E%(playlist_index)02d - %(title)s'
#
# After fetch: rename / copy each .en.srt to nullstone with the canonical
# library filename (`<videobasename>.eng.srt`). For now this is manual —
# automate when the next show comes through.
set -euo pipefail
PLAYLIST="${1:?playlist or channel URL required}"
OUTDIR="${2:?output directory required}"
NAMETMPL="${3:-S%(playlist_index)02d - %(title)s}"
mkdir -p "$OUTDIR"
if ! command -v yt-dlp >/dev/null; then
echo "ERROR: yt-dlp not installed (pip install yt-dlp)" >&2
exit 1
fi
# Pull raw VTT auto-CC, no video, en-orig only (matches en bytewise but is the
# canonical track to request).
yt-dlp --skip-download --write-auto-subs --sub-langs "en-orig" \
--sub-format vtt \
--sleep-requests 1 --sleep-subtitles 2 \
-o "$OUTDIR/${NAMETMPL}-raw.%(ext)s" \
"$PLAYLIST"
CLEANER="$(dirname "$0")/yt-clean.py"
if [[ ! -x "$CLEANER" ]]; then
echo "ERROR: $CLEANER not found / not executable" >&2
exit 2
fi
# Convert each raw VTT to clean SRT
shopt -s nullglob
for vtt in "$OUTDIR"/*-raw.en-orig.vtt; do
out="${vtt%-raw.en-orig.vtt}.en.srt"
python3 "$CLEANER" "$vtt" "$out"
echo "OK $out"
done
echo
echo "next: copy each .en.srt to nullstone with library filename, then library scan."