youtube-learn
Turn a YouTube video into durable, applicable knowledge. Local-only tooling — near-zero API cost. Sub-agents can run this end-to-end and produce output Cyrus reads directly when applying.
When to use
- OC sends a video URL and asks for lessons / application.
- Capturing design / visual look-and-feel from a reference site walkthrough or build tutorial (use
--design-mode). - Building a research file before a sprint (e.g. before a new Atlas build, before a TikTok push).
When NOT to use
- Quick fact lookup from a video → use
web_searchor read the description/comments instead. - Live-streamed content with no archived recording.
Inputs (the spawn prompt MUST supply these)
| Field | Required | Example |
|---|---|---|
url | yes | https://youtube.com/watch?v=... |
topic_tag | yes | tiktok-playbook, directory-design, ai-stack |
apply_to | yes | "the directory-builder skill", "next Atlas build", "TikTok content for @cyrusnorthstarf" |
mode | optional | transcript (default) | design | both |
Pipeline
1. Fetch audio (always)
SLUG="$(echo "$TOPIC_TAG" | tr -cd '[:alnum:]-')"
DATE="$(date +%Y-%m-%d)"
WORKDIR="$HOME/.openclaw/workspace/research/yt-${SLUG}-${DATE}"
mkdir -p "$WORKDIR"
yt-dlp -f "bestaudio[ext=m4a]/bestaudio" \
-o "$WORKDIR/audio.%(ext)s" \
--write-info-json --write-thumbnail \
"$URL"
If --design-mode or --mode=design/both, also pull video:
yt-dlp -f "bestvideo[height<=720]+bestaudio/best[height<=720]" \
-o "$WORKDIR/video.%(ext)s" "$URL"
2. Transcribe locally
cd "$WORKDIR"
AUDIO="$(ls audio.* | head -1)"
whisper "$AUDIO" --model base --output_format txt --output_format srt --language en
Use --model small if domain has heavy jargon. base is the default — ~1.5x realtime on this host, good enough for content extraction.
3. Smart frame extraction (design mode only)
Don't sample every frame — use scene-change detection:
VIDEO="$(ls video.* | head -1)"
mkdir -p frames
ffmpeg -i "$VIDEO" -vf "select='gt(scene,0.3)',showinfo" \
-vsync vfr -q:v 2 frames/scene-%04d.jpg 2>frames/scene.log
For most 1hr videos this produces 40-150 unique frames. If output is >200, raise threshold to 0.4. If <20, lower to 0.2.
4. Vision pass on frames (design mode only)
Use the image tool in batches of 10-20 frames per call. Structured prompt:
Extract for each frame:
- Layout: grid columns, hero/section structure, ATF vs below-fold
- Color palette: 3-5 dominant hex codes
- Typography: hierarchy (H1/H2/body sizes if estimable), serif/sans, weight contrast
- Photography treatment: filter, crop, aspect ratio, subject framing
- Component patterns: card style, button style, navigation
- Spacing rhythm: tight/airy, gutter sizes if estimable
- Anything notable a developer would copy
Output as JSON array, one object per frame.
Pick 5-10 best frames for frames/keep/ — these become reference attachments on future builds.
5. Output: the lessons file
Write research/yt-${SLUG}-${DATE}.md with this exact structure:
# YT-Learn: <Video Title>
**Source:** <URL>
**Channel / Author:** <name>
**Duration:** <mm:ss>
**Topic tag:** <topic_tag>
**Apply to:** <apply_to>
**Captured:** <date>
## TL;DR (3 lines max)
<what this video teaches, in 3 lines>
## Timestamped claims (≥5)
- `[mm:ss]` <claim verbatim or close paraphrase>
- `[mm:ss]` ...
## Applicable to us (≥3, concrete)
Each item must name a specific Atlas / product / surface and a specific action.
- **<surface>**: <action>. Why: <link to claim above>.
- ...
## Rejected / not applicable
- <claim>: <why we don't apply it>
## Open questions
- ...
## Design spec (only if --mode=design or both)
### Palette
- `#xxxxxx` — used for: ...
### Typography
- ...
### Component patterns
- ...
### Reference frames
- `frames/keep/scene-0014.jpg` — <what it shows>
- ...
6. Quality gates (sub-agent must pass before reporting done)
- ≥5 timestamped claims, each with a real
[mm:ss]timestamp. - ≥3 concrete applications, each tied to a named surface and action.
- No vague platitudes. "Be authentic" → cut. "Use 8-px gutters on listing cards in HSA" → keep.
- If
--mode=design: ≥3 hex codes extracted, ≥1 typography note, ≥3 component patterns, ≥5 reference frames inframes/keep/. - File saved at the canonical path. URL of source preserved in the file.
- If audio download or transcribe failed: report blocker, don't fabricate.
7. Cyrus review pass (NOT the sub-agent's job)
When the sub-agent reports done:
- Read the lessons file.
- Spot-check 2 timestamps against the transcript (
grep -A2 "<keyword>" *.txt). - For each application item: agree, push back, or queue.
- Apply = commit, file edit, or queued task. Not "noted."
- If a lesson changes how I operate (a rule, not a one-off), propose adding to MEMORY.md and wait for OC approval.
Spawn template
sessions_spawn runtime=subagent task="""
Run the youtube-learn skill on:
URL: <url>
Topic tag: <slug>
Apply to: <surface>
Mode: <transcript|design|both>
Read ~/.openclaw/workspace/skills/youtube-learn/SKILL.md for the pipeline and quality gates. Produce the lessons file at the canonical path. Report path + a 3-line summary when done.
"""
Cost expectations
- 1hr video, transcript-only: ~5-15 min compute, ~$0.20-0.50 in Cyrus's review tokens.
- 1hr video, design-mode: add ~10-20 min for frame extraction + vision passes, +$0.50-1.50 in vision-model spend (depending on how many frame-batches). Still cheap relative to the leverage.
Reference: tools used
yt-dlp(/usr/local/bin/yt-dlp) — download.whisper(/usr/local/bin/whisper) — transcription, local, no API key.ffmpeg(/usr/bin/ffmpeg) — frame extraction with scene detection.imagetool (vision model) — batched frame analysis.- Output dir:
~/.openclaw/workspace/research/yt-<slug>-<date>/