Knowledge Base Deep Dive
How Claude Studio Producer transforms PDFs into rich, queryable knowledge graphs.
Table of Contents
- Overview
- The Ingestion Pipeline
- What Is an Atom?
- Atom Types
- Real Atom Examples
- Document Graphs
- Content Classification
- Figure Extraction
- The Unified Knowledge Graph
- Cross-Source Linking
- Quality Metrics
- From KB to Production
Overview
When you run cs kb add my-project --paper paper.pdf, a multi-phase pipeline breaks the PDF into atoms — the smallest meaningful units of knowledge. These atoms carry semantic metadata (topics, entities, importance scores, source locations) and are assembled into a DocumentGraph with hierarchy, reading order, and LLM-generated summaries. When multiple papers are added, their atoms merge into a unified KnowledgeGraph with cross-source links and shared topic/entity indices.
A typical 11-page academic paper produces ~153 atoms across 14 types, 88 indexed topics, 42 indexed entities, and 11 extracted figures — all queryable, inspectable, and ready for script generation or video production.
The Ingestion Pipeline
PDF File
│
▼
┌─────────────────────────────────────────────────┐
│ Phase 1: PyMuPDF Extraction │
│ ├─ Text blocks (with bbox, font, bold flags) │
│ ├─ Rendered figures (2x zoom, caption-guided) │
│ └─ PDF metadata (title, authors, DOI) │
└─────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Phase 1.5: Content Classification │
│ ├─ Document type (paper, news, blog, etc.) │
│ ├─ Zone identification (front, body, back) │
│ ├─ Early metadata extraction (institutions) │
│ └─ Extraction rules (what to pull from where) │
└─────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Phase 2: LLM Semantic Analysis (Claude) │
│ ├─ Block type classification (chunked, ~30/req) │
│ ├─ Topic extraction (1-3 per block) │
│ ├─ Entity extraction (algorithms, systems) │
│ ├─ Importance scoring (0.0-1.0) │
│ ├─ Figure description (Claude Vision) │
│ └─ Document summaries (1-sentence to full) │
└─────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Phase 3: Graph Assembly │
│ ├─ Build DocumentAtom objects │
│ ├─ Establish hierarchy (sections → paragraphs) │
│ ├─ Set reading order (flow) │
│ ├─ Store figure PNGs to disk │
│ └─ Save DocumentGraph JSON │
└─────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Phase 4: Knowledge Graph Rebuild │
│ ├─ Merge all sources' atoms │
│ ├─ Build topic_index and entity_index │
│ ├─ Detect cross-source entity links │
│ ├─ Extract key themes (noise-filtered) │
│ └─ Save unified KnowledgeGraph JSON │
└─────────────────────────────────────────────────┘
What Is an Atom?
An atom is the smallest unit of extracted knowledge. Every piece of information from the PDF — a paragraph, a figure, an equation, an author name, a section header — becomes an atom with rich metadata attached.
Atom Fields
| Field | Type | Description |
|---|---|---|
atom_id |
string | Globally unique ID (e.g., doc_a565_atom_014) |
atom_type |
enum | One of 14 types (see below) |
content |
string | The text content or description |
raw_data |
bytes | Image data for figures (stored as PNG, not serialized to JSON) |
source_page |
int | Page number in the original PDF (0-indexed) |
source_location |
(x0,y0,x1,y1) | Bounding box in PDF coordinates |
topics |
list[str] | 1-3 semantic concepts (LLM-extracted) |
entities |
list[str] | Named algorithms, systems, datasets, acronyms |
relationships |
list[str] | Atom IDs this atom references |
importance_score |
float | 0.0-1.0 centrality to the document |
caption |
string | For figures/tables: the caption text |
figure_number |
string | e.g., “FIGURE 1”, “Table 2” |
data_summary |
string | LLM-generated description of visual content |
Importance Score Guidelines
| Score | Used For |
|---|---|
| 1.0 | Title, key findings |
| 0.8 | Abstract, section headers, conclusions |
| 0.7 | Figures with key results |
| 0.5 | Body paragraphs, keywords |
| 0.4 | Equations |
| 0.3 | Citations, author info, metadata, boilerplate |
Atom Types
The system recognizes 14 atom types across three categories:
Text Atoms
| Type | Description | Example |
|---|---|---|
title |
Document title | “Precise Positioning Method of UAV…” |
abstract |
Full abstract text | “This study addresses the challenge…” |
section_header |
Section/subsection heading with intro text | “I. INTRODUCTION With the rapid…” |
paragraph |
Body paragraph | Multi-sentence content block |
quote |
Notable quoted text | Direct quotes from sources |
citation |
References, DOIs, licensing info | “Digital Object Identifier 10.1109/…” |
Visual Atoms
| Type | Description | Example |
|---|---|---|
figure |
Extracted figure with AI-generated description | Flowchart of particle filter algorithm |
chart |
Data charts/graphs | Bar charts, line graphs |
table |
Tabular data | Comparison tables |
equation |
Mathematical expressions | “E[Wk] = 0” |
diagram |
Technical diagrams | System architecture diagrams |
Meta Atoms
| Type | Description | Example |
|---|---|---|
author |
Author names and affiliations | “YANGMEI ZHANG, School of Electronic…” |
date |
Publication/submission dates | “Received 26 October 2025, accepted…” |
keyword |
Author-specified keywords | “Dynamic environment, Kalman filter…” |
Real Atom Examples
These are actual atoms from the aerial-vehicle-positioning knowledge base (153 atoms from an 11-page IEEE paper):
Title Atom (importance: 1.0)
{
"atom_id": "doc_a5654cad96dc_atom_002",
"atom_type": "title",
"content": "Precise Positioning Method of Unmanned Aerial Vehicle in Enclosed
Environments by Integrating Multi-Sensor Information...",
"source_page": 0,
"source_location": [36.17, 149.18, 449.81, 272.32],
"topics": ["UAV positioning", "multi-sensor fusion", "Kalman filter", "particle filter"],
"entities": ["Kalman Filter", "Particle Filter"],
"importance_score": 1.0
}
Abstract Atom (importance: 0.8)
{
"atom_id": "doc_a5654cad96dc_atom_007",
"atom_type": "abstract",
"content": "ABSTRACT This study addresses the challenge of the precise positioning
of Unmanned Aerial Vehicles (UAVs) in enclosed environments...",
"source_page": 0,
"topics": ["UAV positioning", "Kalman filter", "particle filter",
"multi-sensor fusion", "dynamic environment adaptation"],
"entities": ["IKF-PF", "Kalman Filter", "Particle Filter", "ROS",
"Gazebo", "LOS", "NLOS", "GPS"],
"importance_score": 0.8
}
Figure Atom (importance: 0.7, with AI description)
{
"atom_id": "doc_a5654cad96dc_fig_000",
"atom_type": "figure",
"content": "This flowchart illustrates the iterative process of a particle
filter algorithm used for UAV positioning in enclosed environments. The
process begins with initialization, then cycles through importance sampling,
weight normalization, and resampling steps...",
"source_page": 3,
"importance_score": 0.7,
"caption": "FIGURE 1. The principle of the PF algorithm.",
"figure_number": "FIGURE 1",
"data_summary": "This flowchart illustrates the iterative process of a
particle filter algorithm..."
}
The content and data_summary are generated by Claude Vision — the actual figure PNG is stored separately in sources/{source_id}/figures/{atom_id}.png.
Body Paragraph Atom (importance: 0.5)
{
"atom_id": "doc_a5654cad96dc_atom_014",
"atom_type": "paragraph",
"content": "delivery, environmental monitoring, emergency rescue, and indoor
inspection have become increasingly widespread. The precise positioning
capability of UAVs is one of the critical factors...",
"source_page": 1,
"source_location": [36.17, 65.49, 277.38, 614.12],
"topics": ["UAV positioning", "GPS limitations", "multi-sensor fusion"],
"entities": ["GPS", "IMU", "UWB"],
"importance_score": 0.5
}
Equation Atom (importance: 0.4)
{
"atom_id": "doc_a5654cad96dc_atom_023",
"atom_type": "equation",
"content": "E[Wk] = 0,",
"source_page": 2,
"source_location": [110.26, 705.76, 157.88, 717.84],
"topics": ["expected value"],
"importance_score": 0.4
}
Metadata Atom (importance: 0.3, topics suppressed)
{
"atom_id": "doc_a5654cad96dc_atom_004",
"atom_type": "author",
"content": "1School of Electronic Engineering, Xihang University, Xi'an 710077...",
"source_page": 0,
"topics": [],
"entities": ["Xihang University"],
"importance_score": 0.3
}
Note: topics are empty for author/affiliation atoms. The content classifier identifies these as BIOGRAPHICAL zone blocks and suppresses topic extraction to prevent institutional names from polluting the topic index.
Document Graphs
A DocumentGraph wraps all atoms from a single source with structural metadata:
DocumentGraph
├── document_id: "doc_a5654cad96dc"
├── source_path: "/path/to/paper.pdf"
├── title: "Precise Positioning Method of UAV..."
├── authors: ["YANGMEI ZHANG", "YANG BI", ...]
├── page_count: 11
│
├── atoms: {atom_id → DocumentAtom} # All 153 atoms
│
├── hierarchy: # Section → children
│ ├── atom_011 (I. INTRODUCTION) → [atom_014, atom_015, ...]
│ ├── atom_025 (II. METHODS) → [atom_026, atom_027, ...]
│ └── atom_080 (III. RESULTS) → [atom_081, atom_082, ...]
│
├── flow: [atom_000, atom_001, ..., atom_152] # Reading order
│
├── one_sentence: "This study proposes an IKF-PF fusion model..."
├── one_paragraph: "The paper addresses precise UAV positioning..."
├── full_summary: "..." (multi-paragraph)
│
├── figures: [fig_000, fig_001, ..., fig_010] # 11 figure atom IDs
├── tables: []
└── key_quotes: []
Hierarchy
The hierarchy tracks parent-child relationships between atoms. When the LLM identifies a section_header, all subsequent paragraph, equation, quote, and figure atoms become its children until the next section header. This lets you query “give me everything in the Methods section” programmatically:
methods_atoms = doc_graph.get_section("Methods")
# Returns all child atoms under the Methods header
Summaries
The LLM generates three levels of summary from an abbreviated context (first 15 + last 10 blocks, ~15KB):
- one_sentence: Single-sentence thesis/finding
- one_paragraph: Key contributions and approach
- full_summary: Comprehensive multi-paragraph overview
Content Classification
Before sending anything to the LLM, a ContentClassifier performs fast, deterministic pre-analysis. This is critical for quality — it prevents metadata from contaminating the topic index.
Document Type Detection
The classifier checks signals in priority order:
| Signal | Detected Type | Confidence |
|---|---|---|
| DOI or arXiv pattern | SCIENTIFIC_PAPER |
0.9 |
| “Abstract” header | SCIENTIFIC_PAPER |
0.7 |
| “References” section near end | SCIENTIFIC_PAPER |
0.6 |
| Dateline (CITY, Month Day) | NEWS_ARTICLE |
0.8 |
| AP/Reuters/byline pattern | NEWS_ARTICLE |
0.7 |
| Dataset keywords (columns, schema) | DATASET_README |
0.7 |
| Multiple equations (>3) | SCIENTIFIC_PAPER |
0.5 |
Zone Identification
For a scientific paper, the classifier divides the document into zones:
Page 0-1: FRONT_MATTER (title, authors, abstract, affiliations)
Page 1-8: BODY (introduction through conclusion)
Page 8-9: BIOGRAPHICAL (author bios, institution details)
Page 9-11: BACK_MATTER (references, acknowledgments)
How Zones Affect Extraction
| Zone | Topics Extracted? | Entities Extracted? | Treated As |
|---|---|---|---|
BODY |
Yes | Yes | Main content |
FRONT_MATTER |
Yes | Yes | Key context |
BIOGRAPHICAL |
No | Limited | Metadata only |
BACK_MATTER |
No | No | References |
BOILERPLATE |
No | No | Ignored |
This is why author affiliations like “Northwestern Polytechnical University” appear in entities but never in topics — the classifier knows they’re biographical metadata, not paper content.
Theme Candidate Filtering
Even within body zones, a is_theme_candidate() filter rejects:
- Institutional names: university, institute, department, laboratory, school, hospital
- Journal/venue names: IEEE, ACM, Springer, workshop, proceedings, symposium
- Too-short terms: Single words under 6 characters
- Pure numbers: Digit-only strings
Figure Extraction
The system uses two strategies for extracting figures, with the rendered approach as default for academic PDFs.
Rendered Figure Extraction (Primary)
- Render each PDF page at 2x zoom using PyMuPDF
- Scan for caption patterns:
FIGURE,Fig.,TABLE,Tablefollowed by a number - For each caption found, clip a region 400 points above the caption
- Export the clipped region as PNG
- Send to Claude Vision for description (2-3 sentences)
- Parse
figure_numberfrom caption text
This approach is superior for academic PDFs because figures are often composed from multiple sub-images that are separate elements in the PDF but form a single logical figure.
Embedded Image Extraction (Fallback)
- Use
page.get_images()to find raw embedded images - Extract via
doc.extract_image(xref) - Deduplicate by xref ID
- Filter out tiny images (<150px or <5KB) and oversized images (>10,000px)
Figure Atoms
Each extracted figure becomes an atom with:
content: Claude Vision’s description of what the figure showscaption: The parsed caption text (e.g., “FIGURE 1. The principle of the PF algorithm.”)figure_number: The parsed identifier (e.g., “FIGURE 1”)data_summary: Same as content (AI-generated description)raw_data: PNG bytes (stored to disk, not serialized in JSON)
The Unified Knowledge Graph
When a project has multiple sources, all DocumentGraphs are merged into a single KnowledgeGraph:
KnowledgeGraph
├── project_id: "kb_735f1cffaff7"
│
├── atoms: {all atoms from all sources}
├── atom_sources: {atom_id → source_id} # Track provenance
│
├── topic_index: # Fast lookup
│ ├── "particle filter" → [27 atom IDs]
│ ├── "Kalman filter" → [13 atom IDs]
│ ├── "sensor fusion" → [8 atom IDs]
│ └── ... (88 topics total)
│
├── entity_index:
│ ├── "PF" → [atom IDs]
│ ├── "IMU" → [atom IDs]
│ └── ... (42 entities total)
│
├── cross_links: [CrossSourceLink, ...] # Inter-source connections
│
├── key_themes: ["particle filter", "Kalman filter", "state estimation", ...]
└── unified_summary: "..."
Topic Index
The topic index maps every extracted topic to the atom IDs that mention it. This enables queries like “show me everything about particle filters” across all sources in the project.
For the UAV paper:
"particle filter"appears in 27 atoms (most prevalent concept)"Kalman filter"appears in 13 atoms"sensor fusion"appears in 8 atoms- 88 unique topics total across 153 atoms
Entity Index
Similarly, entities (algorithms, systems, acronyms, proper names) are indexed:
- Acronyms:
PF,IKF-PF,IMU,IKF,KF,GPS,UWB,UKF - Proper names:
Xihang University,Kalman Filter,Particle Filter - 42 unique entities total
Key Themes
Key themes are the most significant topics after aggressive noise filtering:
- Filter stopwords (87 generic academic terms: “machine”, “learning”, “analysis”, etc.)
- Apply
is_theme_candidate()to reject institutional/venue names - For multi-source projects: require topics to appear in 2+ sources
- Enforce multi-word phrases or 6+ characters for single words
- Take top 10 surviving topics
Result for the UAV paper:
✓ "particle filter"
✓ "Kalman filter"
✓ "particle weight calculation"
✓ "state estimation"
✓ "sensor fusion"
✓ "NLOS environments"
✓ "UAV positioning"
✓ "LOS environments"
✓ "runtime analysis"
✓ "resampling"
Cross-Source Linking
When a project contains multiple papers, the system automatically discovers connections between them.
How Links Are Created
During knowledge graph rebuild, for each entity that appears in atoms from 2+ different sources, the system creates a CrossSourceLink:
{
"link_id": "link_abc123",
"source_atom_id": "doc_aaa_atom_042",
"target_atom_id": "doc_bbb_atom_017",
"source_source_id": "src_aaa",
"target_source_id": "src_bbb",
"relationship": "same_topic",
"confidence": 0.6,
"created_by": "auto"
}
Link Types
| Relationship | Description |
|---|---|
same_topic |
Both atoms discuss the same entity (auto-detected) |
supports |
One atom provides evidence for the other |
contradicts |
Atoms present conflicting findings |
extends |
One atom builds on the other’s work |
Currently, only same_topic links are auto-generated. Other types are available for future user annotation.
Shared Topics/Entities
The graph provides methods to find overlap between sources:
shared_topics = kg.get_shared_topics() # Topics in 2+ sources
shared_entities = kg.get_shared_entities() # Entities in 2+ sources
Quality Metrics
The kb inspect command computes four quality scores entirely from the stored graph data (no LLM calls needed).
Topic Quality Score (0-100)
Formula: (good_topics / total_topics) * 100
A topic is classified as noise if it matches any of:
- Structural terms (49 hardcoded): “figure”, “abstract”, “methodology”, “introduction”, “results”, etc.
- Institutional/venue names: detected via
is_theme_candidate() - Length-based: less than 3 characters, pure digits, or single words under 6 characters
A score of 100/100 means zero noise topics were detected — every extracted topic is a genuine semantic concept.
Entity Quality Score (0-100)
Formula: (acronyms + proper_names) / total_entities * 100
Entities are categorized by pattern:
- Acronyms: All caps, 2-6 chars (e.g., “PF”, “GPS”, “IMU”)
- Proper names: Capitalized multi-word (e.g., “Kalman Filter”, “Xihang University”)
- Other: Everything else (potentially noisy)
A score of 92/100 means 92% of entities are well-formed acronyms or proper names.
Atom Type Distribution
Shows how the document was decomposed:
paragraph ███████░░░░░░░░░ 54 (35.3%)
equation █████░░░░░░░░░░░ 45 (29.4%)
section_header ██░░░░░░░░░░░░░░ 19 (12.4%)
author █░░░░░░░░░░░░░░░ 11 (7.2%)
figure █░░░░░░░░░░░░░░░ 11 (7.2%)
citation █░░░░░░░░░░░░░░░ 8 (5.2%)
This distribution tells you the nature of the source material. The UAV paper is equation-heavy (29.4%), which is typical for a signal processing paper.
Concept Distribution
The top 12 topics ranked by atom count, normalized to the highest:
particle filter ████████████████████ 27
Kalman filter █████████░░░░░░░░░░░ 13
particle weight calculation ██████░░░░░░░░░░░░░░ 9
state estimation █████░░░░░░░░░░░░░░░ 8
From KB to Production
The knowledge graph feeds directly into the production pipeline:
Script Generation (kb script)
- Reads the unified KG’s atoms, summaries, and key themes
- Figures become explicit references in the script (e.g., “As shown in Figure 3…”)
- Topic distribution guides content emphasis
- Cross-source links inform comparison segments
Video Production (kb produce / produce-video)
- Scene importance scoring uses atom
importance_scoreto allocate visual budget - Figures from the KB are available as source material for image generation
- The ContentLibrarian tracks which KB figures are used where
- The DoP (Director of Photography) assigns visual treatments based on atom types
Example Flow
KB (153 atoms, 11 figures, 88 topics)
→ Script (10 segments, references 6 figures)
→ Visual Plan (medium tier: 27% images)
→ 3 DALL-E images + Ken Burns animation
→ Final video with per-scene audio
File Reference
| Component | File |
|---|---|
| Atom & Document models | core/models/document.py |
| Knowledge Graph models | core/models/knowledge.py |
| Document Ingestor agent | agents/document_ingestor.py |
| Content Classifier | core/content_classifier.py |
| KB CLI (inspect, add, rebuild) | cli/kb.py |
| JSON Extractor | core/claude_client.py |