Developer Journal

A chronological record of development decisions, discoveries, and lessons learned while building Claude Studio Producer.

View full developer notes →

= Aaron = Claude

Recent Updates

Feb 15, 2026 - YouTube Publishing

I spent the last week playing with OpenClaw and even had my new agent, Lilit, start working with the codebase. Some of the most recent updates like getting the OpenTTS provider working and the YouTube publishing working was courtesy of Lilit.

I think the most interesting aspect of this was able to teach Lilit to use this studio CLI, and now I can just ask for a podcast about x topic and I’ll get one.

Yesterday I wanted to learn about the latest advances in memory, and she provided two papers from 2026. I selected said make a podcast for each. One was very long, 17 pages, and something was causing the subagent to fail extracting the pdf, so she made a github issue about it. I probably need to figure out how to have that agent get a proper identity like when Claude Code makes an update and you see that we committed together, rather that it coming up as an issue from me.

Anyways, the second paper about “FadeMem”, had no such issue. Here’s the video: https://youtu.be/eToEeH0yz4o

One key takeaway is that this was produced by me just having a conversation with the agent, Lilit, and suggesting a more entertaining script, less serious, more John Oliver, more Jon Stewart. The other thing was that I ran out of elevenlabs credits, so Lilit asked me if I wanted to try OpenTTS. I said sure. And it thrashed a little bit and chewed through some opus credits, but it figured it out and published the video using the Onyx voice… while I had dinner. Pretty neat actually.

Prior to that interaction though, I produced this video using the CLI myself, using a pdf about LLM hallucinations and this one shows off some other advances in the studio’s tooling.

This video is a culmination of several advances:

karaoke style text renderings
more selective image inputs (oh yeah, we have a new wikimedia provider!)
knowledge base has better alignment with the content and we’re also timeline aware so that the spoken word of the script is better about trigger relevant visuals
there’s probably more, but I can’t recall right now because it has been a pretty intense week!

I’ll see if Lilit cares to chime in on that. No doubt she will.

Oh! Classic burying the lede: you can just have OpenClaw (or your agent of choice) drive this studio and generate videos from w/e source content you want. The CLI is so feature rich that they can tinker with it and make a wide variety of content about your source material, so my focus on science papers was purely self-limiting on my laser focus. GLHF!

Feb 9, 2026 - Content-Aware Document Classification

The KB ingestion pipeline was treating all documents identically, which caused metadata pollution — author affiliations and university names were leaking into key themes. Now a ContentClassifier runs before the LLM to identify document type and structural zones.

What changed:

Pre-LLM classification: Heuristics on font sizes, positions, and text patterns detect document type (scientific paper, news article, etc.) and identify zones (front matter, body, back matter)
Zone-aware topic filtering: Blocks in metadata zones (affiliations, author bios) no longer produce topics. is_theme_candidate() catches institutional and venue names.
Chunked LLM analysis: Large documents now classify blocks in batches of ~30, avoiding output token truncation. Truncated JSON repair added as a safety net.

The result: cleaner knowledge graphs with themes that reflect actual content, not bibliographic metadata.

Feb 7, 2026 - Content Model Expansion

Extended StructuredScript with content-agnostic vocabulary and source attribution for broader use cases beyond scientific podcasts.

Content-Agnostic Intent Vocabulary: Replaced paper-specific intents (METHODOLOGY, KEY_FINDING, etc.) with 19 universal intents that work across content types:

Structural: INTRO, TRANSITION, RECAP, OUTRO
Exposition: CONTEXT, EXPLANATION, DEFINITION, NARRATIVE
Evidence: CLAIM, EVIDENCE, DATA_WALKTHROUGH, FIGURE_REFERENCE
Analysis: ANALYSIS, COMPARISON, COUNTERPOINT, SYNTHESIS
Editorial: COMMENTARY, QUESTION, SPECULATION

Source Attribution: New SourceType (PAPER, NEWS, DATASET, GOVERNMENT, TRANSCRIPT, NOTE, ARTIFACT, URL) and SourceAttribution models track content provenance with confidence scores.

Variant/Perspective Support: perspective field on segments and scripts enables bias analysis workflows (left/right news variants sharing same source attributions).

Backward Compatibility: Intent mapping preserves existing scripts:

BACKGROUND → CONTEXT
METHODOLOGY → EXPLANATION
KEY_FINDING → CLAIM
FIGURE_WALKTHROUGH → FIGURE_REFERENCE
DATA_DISCUSSION → DATA_WALKTHROUGH

This enables news comparison, multi-source synthesis, and policy analysis workflows while maintaining compatibility with the existing podcast training pipeline.

Feb 7, 2026 - DoP and Unified Production (Phase 4)

Implemented Phase 4 of the Unified Production Architecture: the Director of Photography (DoP) module and ContentLibrarian integration.

DoP Module (core/dop.py):

Assigns visual display modes to script segments (figure_sync, dall_e, carry_forward, text_only)
Respects budget tier ratios for proportional image allocation
Prioritizes segments by importance score for DALL-E generation
Links to existing approved assets in ContentLibrary
Generates visual direction hints for image prompts
100% deterministic - no LLM calls needed

Integration (cli/produce_video.py):

ContentLibrarian now wired into video production pipeline
StructuredScript is the source of truth when available
DoP replaces manual budget allocation logic
Visual planning now shows figure_sync, dall_e, carry_forward, and text_only modes
Asset reuse across runs - approved images aren’t regenerated

Example Output:

DoP visual assignment:
  figure_sync: 3 segments (KB figures)
  dall_e: 5 segments (new generation needed)
  carry_forward: 7 segments (reuse previous)
  text_only: 2 segments (transitions)

Estimated cost: $0.40 (5 DALL-E images)

The pipeline is now unified - both produce and produce-video commands share the same StructuredScript and ContentLibrary data layer. This enables incremental regeneration and asset reuse across runs.

Test Coverage: 116 tests passing (81 unit + 35 integration) covering all phases of the unified architecture, provider integrations, and end-to-end workflows.

Feb 7, 2026 - Training Outputs StructuredScript (Phase 3)

Training pipeline now outputs structured scripts alongside flat text files.

What Changed:

After generating _script.txt, trainer parses it with StructuredScript.from_script_text()
Enriches figure inventory with captions from the document graph
Saves {pair_id}_structured_script.json per training pair

This bridges training and production: video production can now load structured scripts directly instead of re-parsing flat text. Figure references in scripts are pre-resolved with full metadata.

Feb 7, 2026 - Unified Production Architecture (Phase 1)

Implemented Phase 1 of the UNIFIED_PRODUCTION_ARCHITECTURE.md spec, establishing new data models as the foundation for the unified pipeline.

StructuredScript Model: Single source of truth replacing flat _script.txt files. The from_script_text() parser extracts Figure N references and section boundaries from existing scripts, enabling structured access to script content.

ContentLibrary Model: Persistent asset registry with approval tracking. Includes from_asset_manifest_v1() for migrating existing asset manifests to the new format. Tracks image/audio assets with generation status and approval state.

All 55 unit tests passing.

Feb 7, 2026 - Proportional Budgets & Audio Source Fix

Fixed architectural issues in the video production pipeline:

1. Proportional Budget Tiers

Previously, budget tiers used absolute image counts (e.g., “medium = 40 images”). This caused inconsistent quality when testing with scene subsets.

Now tiers use ratios:

low: 10% of scenes get images
medium: 27% of scenes get images
high: 55% of scenes get images
full: 100% of scenes get images

This ensures consistent quality across runs. Testing 5 scenes with medium tier now produces ~1 image (not 5), matching what would happen proportionally in a full production.

2. Audio Uses Generated Script

Audio was incorrectly generated from the original Whisper transcription (“Welcome to Journal Club…”) instead of the new script (“Welcome back to another deep dive…”).

Fixed: Audio now comes from _script.txt paragraphs, not aligned_segments from the original transcription.

3. Audio Respects –limit Parameter

Audio was generating all 45 paragraphs even with --limit 5. Now slices paragraphs proportionally to match scene range.

4. Clear Visual Source Display

Scene list now distinguishes between:

DALL-E - gets unique generated image
shared - shares image with primary scene
text only - no image generated

# Output now shows which scene gets the image:
#  UAV positioning       intro  DALL-E     Ken Burns
#  multi-sensor info     intro  shared     Ken Burns
#  Kalman filter         intro  shared     Ken Burns

Feb 6, 2026 (late evening) - Scene-by-Scene Audio Generation

Added audio generation directly to produce-video, fixing a key architectural issue.

The Problem: Training was generating a full script, then trying to send it all to ElevenLabs at once. This hit character limits and was wasteful - training doesn’t need audio, only production does.

The Solution:

Training generates scripts only (no audio)
produce-video generates audio scene-by-scene during production
Each scene gets its own .mp3 file
Avoids ElevenLabs character limits by chunking naturally
Asset manifest tracks image_path + audio_path per scene

# Produce video with scene-by-scene audio (default: enabled)
claude-studio produce-video -t trial_000 --budget medium --live --voice lily

# Or specify a different voice
claude-studio produce-video -t trial_000 --budget medium --live --voice rachel

Output structure:

artifacts/video_production/20260206_204449/
├── images/
│   ├── scene_000.png
│   └── scene_001.png
├── audio/
│   ├── scene_000.mp3
│   └── scene_001.mp3
├── visual_plans.json
└── asset_manifest.json  # Links images + audio per scene

Feb 6, 2026 (evening) - Figure-Aware Script Generation

Fixed a key architectural issue: training now knows about figures before generating scripts.

The Problem: Scripts were generated without knowing what figures existed, then we tried to match figures afterward via keyword guessing.

The Solution:

Training extracts figures from the document graph
Figure captions/descriptions are passed to Claude in the prompt
Scripts now explicitly reference figures: “As shown in Figure 6…”
Video production does exact matching instead of guessing

Also documented the kb inspect command - shows beautiful quality reports:

claude-studio kb inspect my-project --quality

# Output shows atom distribution with bar charts:
# equation    █████░░░░░   44 (26%)
# paragraph   ████░░░░░░   38 (23%)
# figure      ███░░░░░░░   26 (16%)

Feb 6, 2026 - Training Pipeline & Video Production Integration

Big milestone: the podcast training pipeline and video production workflow are fully integrated!

Training Pipeline (claude-studio training run):

Transcribes reference podcasts using Whisper
Classifies segments (INTRO, BACKGROUND, METHODOLOGY, KEY_FINDING, etc.)
Extracts style profiles for improved script generation

Video Production (claude-studio produce-video):

Takes training output and produces explainer videos
Budget tier system (micro=$0 to full=$15+)
Scene importance scoring allocates images to high-impact moments
KB figures from PDFs appear in videos synced to narration

claude-studio produce-video -t trial_000 --show-tiers
claude-studio produce-video -t trial_000 --budget medium --kb my-project --live

Jan 30, 2026 - Security Hardening

Read some alarming posts about Clawdbot, so did a quick security check. Added __repr__ to Config classes to prevent API key leaks in debug outputs.

Added keychain import feature:

claude-studio secrets import .env

This imports all API keys from .env into your OS keychain, allowing secure storage without environment variables.

Jan 28, 2026 - DALL-E Provider

Stayed focused on core mission instead of getting distracted by Remotion graphics (saving that for later).

Successfully onboarded DALL-E using the provider onboarding agent:

claude-studio provider onboard -n dalle -t image --docs-url https://platform.openai.com/docs/guides/images

The system now supports:

DALL-E 3 (high quality, 1024x1024)
DALL-E 2 (faster, cheaper, multiple sizes)

This enables the DALL-E → Runway pipeline for image-to-video generation.

Jan 26, 2026 - Multi-Provider Pipelines

Completed the pipeline capability to chain providers:

DALL-E generates seed image from text
Runway transforms image to video

This is a key architectural milestone - providers can now feed into each other.

Jan 23, 2026 - Knowledge Base System

Major feature: Document-to-Video pipeline

PDF ingestion with PyMuPDF
Atomic concept extraction from papers/docs
Knowledge base management CLI
Generate videos from research papers or documentation

Example workflow:

claude-studio kb create "AI Research" -d "Latest papers on multi-agent systems"
claude-studio kb add "AI Research" --paper paper.pdf
claude-studio kb produce "AI Research" -p "Explain transformer architecture" --style educational

Jan 20, 2026 - Multi-Tenant Memory

Upgraded memory system to support multi-tenant hierarchy:

SESSION → USER → ORG → PLATFORM
Namespace isolation and security model
Learning promotion/demotion based on validation
Production-ready with Bedrock AgentCore

Jan 8, 2026 - Memory & Dashboard

Provider learning system (tips, gotchas, preferences)
Memory namespace per provider
Web dashboard for viewing runs and QA scores

Jan 7-8, 2026 - Luma Provider Implementation

First real video provider integration:

Comprehensive Luma API spec
Text-to-video without seed images
Image-to-video with start frames
Extend/interpolate capabilities
Aspect ratio mapping
Full error handling

Jan 7, 2026 - Foundation Sprint

Late night/early morning sprint creating all foundation specs:

All 7 agent specifications
System architecture
Strands integration
Provider system design
Audio system tiers
Testing philosophy
Docker dev environment

Jan 6, 2026 - Agent Architecture

Initial agent system design:

ScriptWriter, VideoGenerator, QAVerifier
Editor, Producer, Critic agents
Budget-aware competitive pilot system

Jan 9, 2026 - What Is This Even For?

The Vision

This project demonstrates:

What you can do quickly with Claude
How to design and implement a working multi-agent workflow
Using learning/memory systems
Using rewards and feedback
Having fun!

The Workflow

A virtual studio where:

Producer takes your budget and pitch, crafts pilots based on what works and what you can afford
Script Writer creates scenes knowing the provider’s capabilities and constraints
Video Generator shoots scenes (parallelizable across providers)
QA Agents perform technical review (parallelizable)
Critic assesses overall quality and makes recommendations
Editor creates Edit Decision List (EDL) for final candidate videos

Studio Reinforcement Learning (StudioRL)

The feedback loop stores learnings in memory for the producer and script writer to leverage. The budget system keeps costs under control, allowing re-runs on promising pilots within budget constraints.

Are We Having Fun Yet?

make-it-rain-coffee

Prompt: A 15-second story of a developer having a breakthrough: Scene 1 - Wide shot of developer at desk in cozy home office at night, hunched over laptop, frustrated expression, warm desk lamp lighting. Scene 2 - They lean back with a satisfied smile, stretch arms up in victory celebration, coffee cup visible nearby, cinematic triumph moment.

Result: Make it rain coffee…!

Sometimes the AI interprets your vision in unexpected ways. This is part of why we have the QA and Critic agents - to catch these creative interpretations and decide whether they’re happy accidents or need revision.

← Back to Home

View Specifications →

Full Developer Notes →