youtube-learn

Turn a YouTube video into durable, applicable knowledge. Local-only tooling — near-zero API cost. Sub-agents can run this end-to-end and produce output Cyrus reads directly when applying.

When to use

OC sends a video URL and asks for lessons / application.
Capturing design / visual look-and-feel from a reference site walkthrough or build tutorial (use --design-mode).
Building a research file before a sprint (e.g. before a new Atlas build, before a TikTok push).

When NOT to use

Quick fact lookup from a video → use web_search or read the description/comments instead.
Live-streamed content with no archived recording.

Inputs (the spawn prompt MUST supply these)

Field	Required	Example
`url`	yes	`https://youtube.com/watch?v=...`
`topic_tag`	yes	`tiktok-playbook`, `directory-design`, `ai-stack`
`apply_to`	yes	`"the directory-builder skill"`, `"next Atlas build"`, `"TikTok content for @cyrusnorthstarf"`
`mode`	optional	`transcript` (default) \| `design` \| `both`

Pipeline

1. Fetch audio (always)

SLUG="$(echo "$TOPIC_TAG" | tr -cd '[:alnum:]-')"
DATE="$(date +%Y-%m-%d)"
WORKDIR="$HOME/.openclaw/workspace/research/yt-${SLUG}-${DATE}"
mkdir -p "$WORKDIR"

yt-dlp -f "bestaudio[ext=m4a]/bestaudio" \
  -o "$WORKDIR/audio.%(ext)s" \
  --write-info-json --write-thumbnail \
  "$URL"

If --design-mode or --mode=design/both, also pull video:

yt-dlp -f "bestvideo[height<=720]+bestaudio/best[height<=720]" \
  -o "$WORKDIR/video.%(ext)s" "$URL"

2. Transcribe locally

cd "$WORKDIR"
AUDIO="$(ls audio.* | head -1)"
whisper "$AUDIO" --model base --output_format txt --output_format srt --language en

Use --model small if domain has heavy jargon. base is the default — ~1.5x realtime on this host, good enough for content extraction.

3. Smart frame extraction (design mode only)

Don't sample every frame — use scene-change detection:

VIDEO="$(ls video.* | head -1)"
mkdir -p frames
ffmpeg -i "$VIDEO" -vf "select='gt(scene,0.3)',showinfo" \
  -vsync vfr -q:v 2 frames/scene-%04d.jpg 2>frames/scene.log

For most 1hr videos this produces 40-150 unique frames. If output is >200, raise threshold to 0.4. If <20, lower to 0.2.

4. Vision pass on frames (design mode only)

Use the image tool in batches of 10-20 frames per call. Structured prompt:

Extract for each frame:
- Layout: grid columns, hero/section structure, ATF vs below-fold
- Color palette: 3-5 dominant hex codes
- Typography: hierarchy (H1/H2/body sizes if estimable), serif/sans, weight contrast
- Photography treatment: filter, crop, aspect ratio, subject framing
- Component patterns: card style, button style, navigation
- Spacing rhythm: tight/airy, gutter sizes if estimable
- Anything notable a developer would copy

Output as JSON array, one object per frame.

Pick 5-10 best frames for frames/keep/ — these become reference attachments on future builds.

5. Output: the lessons file

Write research/yt-${SLUG}-${DATE}.md with this exact structure:

# YT-Learn: <Video Title>

**Source:** <URL>
**Channel / Author:** <name>
**Duration:** <mm:ss>
**Topic tag:** <topic_tag>
**Apply to:** <apply_to>
**Captured:** <date>

## TL;DR (3 lines max)

<what this video teaches, in 3 lines>

## Timestamped claims (≥5)

- `[mm:ss]` <claim verbatim or close paraphrase>
- `[mm:ss]` ...

## Applicable to us (≥3, concrete)

Each item must name a specific Atlas / product / surface and a specific action.

- **<surface>**: <action>. Why: <link to claim above>.
- ...

## Rejected / not applicable

- <claim>: <why we don't apply it>

## Open questions

- ...

## Design spec (only if --mode=design or both)

### Palette
- `#xxxxxx` — used for: ...

### Typography
- ...

### Component patterns
- ...

### Reference frames
- `frames/keep/scene-0014.jpg` — <what it shows>
- ...

6. Quality gates (sub-agent must pass before reporting done)

≥5 timestamped claims, each with a real [mm:ss] timestamp.
≥3 concrete applications, each tied to a named surface and action.
No vague platitudes. "Be authentic" → cut. "Use 8-px gutters on listing cards in HSA" → keep.
If --mode=design: ≥3 hex codes extracted, ≥1 typography note, ≥3 component patterns, ≥5 reference frames in frames/keep/.
File saved at the canonical path. URL of source preserved in the file.
If audio download or transcribe failed: report blocker, don't fabricate.

7. Cyrus review pass (NOT the sub-agent's job)

When the sub-agent reports done:

Read the lessons file.
Spot-check 2 timestamps against the transcript (grep -A2 "<keyword>" *.txt).
For each application item: agree, push back, or queue.
Apply = commit, file edit, or queued task. Not "noted."
If a lesson changes how I operate (a rule, not a one-off), propose adding to MEMORY.md and wait for OC approval.

Spawn template

sessions_spawn runtime=subagent task="""
Run the youtube-learn skill on:
URL: <url>
Topic tag: <slug>
Apply to: <surface>
Mode: <transcript|design|both>

Read ~/.openclaw/workspace/skills/youtube-learn/SKILL.md for the pipeline and quality gates. Produce the lessons file at the canonical path. Report path + a 3-line summary when done.
"""

Cost expectations

1hr video, transcript-only: ~5-15 min compute, ~$0.20-0.50 in Cyrus's review tokens.
1hr video, design-mode: add ~10-20 min for frame extraction + vision passes, +$0.50-1.50 in vision-model spend (depending on how many frame-batches). Still cheap relative to the leverage.

Reference: tools used

yt-dlp (/usr/local/bin/yt-dlp) — download.
whisper (/usr/local/bin/whisper) — transcription, local, no API key.
ffmpeg (/usr/bin/ffmpeg) — frame extraction with scene detection.
image tool (vision model) — batched frame analysis.
Output dir: ~/.openclaw/workspace/research/yt-<slug>-<date>/