Ingestion Pipeline
The ingestion pipeline transforms raw data (PDFs, videos, Q&A transcripts) into a structured, searchable knowledge base. It’s a multi-stage process that builds taxonomy, extracts entities, creates knowledge graphs, and deploys to your search index.
Pipeline Overview
The ingestion pipeline transforms raw content into a searchable knowledge base through four main stages. Each stage builds upon the previous, creating increasingly rich and structured data.
Raw Documents (PDFs, Videos, Q&A, Web) │ ▼┌─────────────────────────────────────────────────────────────┐│ Stage 1: PREPROCESS ││ ││ Input: Raw files, transcripts, scraped content ││ Process: ││ • Text extraction and cleaning ││ • Speaker detection (Q&A) ││ • Content chunking with overlap ││ • Normalization and deduplication ││ ││ Output: clean_text/*.jsonl (chunked, cleaned content) │└──────────────────────────────┬──────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Stage 2: STRUCTURE ││ ││ Input: clean_text/*.jsonl ││ Process: ││ • Taxonomy building (hierarchical categories) ││ • Ontology creation (entity-relationship schemas) ││ • Content sampling and LLM analysis ││ • Semantic relationship mapping ││ ││ Output: taxonomy.json, ontology.json │└──────────────────────────────┬──────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Stage 3: INDEX ││ ││ Input: clean_text/*.jsonl + taxonomy.json + ontology.json ││ Process: ││ • Named Entity Recognition (NER) ││ • Key phrase extraction ││ • Knowledge graph construction ││ • Shadow record generation with enriched metadata ││ • Relationship linking and graph edges ││ ││ Output: knowledge_graph.pickle, shadow/*.jsonl │└──────────────────────────────┬──────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ Stage 4: DEPLOY ││ ││ Input: shadow/*.jsonl + knowledge_graph.pickle ││ Process: ││ • Format conversion (JSONL/SQL) ││ • Embedding generation ││ • Vector index population ││ • Knowledge graph upload ││ • Ingestion report generation ││ ││ Output: Data in Postgres/Vertex + report.html │└─────────────────────────────────────────────────────────────┘Quick Start
Using the CLI
The fastest way to run ingestion:
# Full pipeline for a bookcontextrouter ingest run --type book --input ./my-book.pdf
# Full pipeline for Q&A transcriptscontextrouter ingest run --type qa --input ./transcripts/
# Run specific stagescontextrouter ingest preprocess --type bookcontextrouter ingest structure --type bookcontextrouter ingest index --type bookcontextrouter ingest deploy --type bookUsing Python
from contextrouter.cortex.graphs.rag_ingestion import compile_graph
graph = compile_graph()
result = await graph.ainvoke({ "ingestion_config_path": "./settings.toml", "only_types": ["book"], "overwrite": True,})Content Types & Plugins
ContextRouter uses specialized plugins for different content types. Each plugin includes custom transformers that understand the specific structure and requirements of that content type.
Book Plugin (@register_ingestion_plugin("book"))
For long-form documents (PDFs, ebooks, technical manuals):
Transformers Used:
- BookAnalyzer: Chapter/section detection, table of contents extraction
- BookExtractor: Page-level citations, figure/table handling, footnote processing
- BookNormalizer: Cross-reference resolution, glossary extraction
Special Features:
- Maintains page-level citation accuracy
- Preserves document structure hierarchy
- Handles multi-column layouts and complex formatting
- Extracts mathematical equations and code blocks
Video Plugin (@register_ingestion_plugin("video"))
For video transcripts and multimedia content:
Transformers Used:
- VideoAnalyzer: Timestamp alignment, scene boundary detection
- VideoSpeakerDetector: Speaker identification and attribution
- VideoNormalizer: Timestamp formatting, visual description integration
Special Features:
- Synchronizes text with video timestamps
- Detects speaker changes and interruptions
- Integrates visual scene descriptions
- Handles multiple language tracks
QA Plugin (@register_ingestion_plugin("qa"))
For question-answer transcripts, interviews, and conversational content:
Transformers Used:
- QAAnalyzer: Question-answer pairing, follow-up linking
- QASpeakerDetector: Speaker attribution using heuristics and LLM analysis
- QATaxonomyMapper: Topic clustering and conversation flow analysis
- QATransformer: Answer validation, correction mapping, host detection
Special Features:
- Distinguishes questions from answers automatically
- Links related Q&A pairs in conversation threads
- Identifies session hosts and panelists
- Applies custom corrections for known transcription errors
Web Plugin (@register_ingestion_plugin("web"))
For scraped web content, articles, and online documents:
Transformers Used:
- WebAnalyzer: HTML cleaning, content extraction, readability scoring
- WebNormalizer: URL normalization, date detection, author extraction
- WebLinkExtractor: Related link discovery and categorization
Special Features:
- Removes boilerplate content (headers, footers, ads)
- Preserves article publication dates and authors
- Extracts structured metadata (OpenGraph, schema.org)
- Handles paywall and subscription content
Knowledge Plugin (@register_ingestion_plugin("knowledge"))
For structured knowledge bases and databases:
Transformers Used:
- KnowledgeAnalyzer: Schema detection, relationship mapping
- KnowledgeNormalizer: Data type normalization, validation
- KnowledgeMapper: Ontology alignment, concept linking
Special Features:
- Handles structured data (JSON, CSV, databases)
- Maintains referential integrity
- Supports custom ontologies and taxonomies
Transformers in Ingestion
Transformers are modular components that enrich and structure data during ingestion. Each transformer focuses on a specific type of data processing and can be configured independently.
Core Transformers
NER Transformer (@register_transformer("ner"))
Purpose: Named Entity Recognition and extraction
What it does:
- Identifies persons, organizations, locations, dates, etc.
- Extracts technical terms and domain-specific entities
- Links entities across documents
- Generates confidence scores for extractions
Configuration:
[ingestion.rag.transformers.ner]enabled = truemodel = "vertex/gemini-2.0-flash" # or local modelsentity_types = ["PERSON", "ORG", "GPE", "DATE", "MONEY", "PERCENT"]confidence_threshold = 0.7max_entities_per_chunk = 20Taxonomy Transformer (@register_transformer("taxonomy"))
Purpose: Automatic categorization and tagging
What it does:
- Classifies content into hierarchical categories
- Generates topic tags and keywords
- Creates content clusters for similar documents
- Builds taxonomy trees for navigation
Configuration:
[ingestion.rag.transformers.taxonomy]enabled = truemodel = "vertex/gemini-2.0-flash"max_categories = 5category_depth = 3confidence_threshold = 0.6custom_categories = ["Machine Learning", "AI Ethics", "Data Science"]Keyphrases Transformer (@register_transformer("keyphrases"))
Purpose: Extract important phrases and concepts
What it does:
- Identifies key phrases that capture document essence
- Extracts technical terms and jargon
- Generates search-friendly keywords
- Supports multi-language phrase extraction
Configuration:
[ingestion.rag.transformers.keyphrases]enabled = truealgorithm = "mixed" # "llm", "tfidf", "mixed"max_phrases = 10min_phrase_length = 2max_phrase_length = 5language = "en"Shadow Record Transformer (@register_transformer("shadow"))
Purpose: Generate optimized search metadata
What it does:
- Creates enriched metadata for search optimization
- Combines multiple analysis results
- Generates search-friendly text representations
- Pre-computes frequently accessed fields
Configuration:
[ingestion.rag.transformers.shadow]enabled = trueinclude_keywords = trueinclude_entities = trueinclude_summary = trueinclude_taxonomy = truesummary_length = 200 # charactersGraph Builder Transformer (@register_transformer("graph"))
Purpose: Construct knowledge graph relationships
What it does:
- Analyzes entity co-occurrence patterns
- Builds semantic relationships between concepts
- Creates graph edges with confidence scores
- Supports both LLM and rule-based approaches
Configuration:
[ingestion.rag.transformers.graph]enabled = truebuilder_mode = "hybrid" # "llm", "local", "hybrid"max_entities_per_chunk = 10relationship_types = ["related_to", "part_of", "causes", "affects"]cognee_enabled = truemin_confidence = 0.3Configuration
[ingestion.rag]enabled = trueoutput_dir = "./ingestion_output"
[ingestion.rag.preprocess]chunk_size = 1000chunk_overlap = 200min_chunk_size = 100
[ingestion.rag.graph]builder_mode = "hybrid" # "llm", "local", or "hybrid"cognee_enabled = truemax_entities_per_chunk = 10
[ingestion.rag.shadow]include_keywords = trueinclude_entities = trueinclude_summary = true
[ingestion.rag.skip]# Skip stages that are already completepreprocess = falsestructure = falseindex = falsedeploy = falseDetailed Pipeline Stages
Stage 1: Preprocess - Text Extraction & Normalization
The preprocessing stage converts raw input files into clean, structured text chunks that can be processed by subsequent stages.
Input Types Supported:
- PDF files: Text extraction with layout preservation
- Video transcripts: Timestamp synchronization and speaker attribution
- Q&A transcripts: Speaker detection and conversation flow analysis
- Web content: HTML cleaning and content extraction
- Plain text: Encoding detection and normalization
Key Processes:
-
Text Extraction
- PDF: Uses advanced OCR for scanned documents, preserves formatting
- Video: Aligns transcript text with timestamps, handles multiple speakers
- Web: Removes HTML tags, extracts main content, preserves metadata
-
Content Cleaning
- Removes noise: headers, footers, page numbers, watermarks
- Normalizes whitespace and formatting
- Handles encoding issues and special characters
- Filters out irrelevant content (advertisements, navigation)
-
Speaker Detection (Q&A content)
- Uses heuristics: punctuation patterns, capitalization
- Applies LLM analysis for complex cases
- Identifies conversation participants and roles
-
Intelligent Chunking
- Sliding window: Overlapping chunks preserve context
- Semantic boundaries: Respects sentence/paragraph boundaries
- Content-aware: Avoids splitting related information
- Size optimization: Balances retrieval precision vs. context
Configuration Options:
[ingestion.rag.preprocess]chunk_size = 1000 # Target chunk size in characterschunk_overlap = 200 # Overlap between chunksmin_chunk_size = 100 # Minimum chunk sizemax_chunk_size = 2000 # Maximum chunk sizeencoding = "utf-8" # Text encodingpreserve_formatting = true # Keep bold/italic in markdownCLI Usage:
# Basic preprocessingcontextrouter ingest preprocess --type book --input ./document.pdf
# Advanced optionscontextrouter ingest preprocess \ --type video \ --input ./transcripts/ \ --chunk-size 800 \ --chunk-overlap 150 \ --encoding utf-8Output Format: clean_text/{type}.jsonl
{"id": "chunk_001", "content": "Machine learning is a subset of AI...", "metadata": {"page": 1, "speaker": null}}{"id": "chunk_002", "content": "...that enables computers to learn...", "metadata": {"page": 1, "speaker": null}}Stage 2: Structure - Taxonomy & Ontology Building
The structure stage analyzes content to build semantic frameworks that organize knowledge hierarchically and define relationships between concepts.
Taxonomy Building Process:
-
Content Sampling
- Selects representative chunks across the entire document
- Uses stratified sampling to ensure coverage of different sections
- Considers document structure (chapters, sections) when available
-
Category Discovery
- LLM analyzes content to identify main themes and topics
- Builds hierarchical category trees (e.g., AI → Machine Learning → Deep Learning)
- Applies confidence scoring and validation
-
Semantic Clustering
- Groups similar concepts and topics together
- Identifies relationships between categories
- Creates navigation-friendly hierarchies
Ontology Creation Process:
-
Entity Type Definition
- Identifies common entity types in the domain
- Defines relationships between entity types
- Creates schemas for structured data extraction
-
Relationship Mapping
- Defines semantic relationships (is-a, part-of, related-to)
- Establishes domain-specific connection types
- Validates relationship consistency
-
Schema Generation
- Creates formal ontologies in JSON format
- Supports multiple ontology standards
- Enables cross-document relationship linking
Configuration Options:
[ingestion.rag.structure]enabled = true
[ingestion.rag.structure.taxonomy]philosophy_focus = "Extract core concepts, terminology, and relationships"include_types = ["video", "book", "qa", "knowledge"]max_samples = 100scan_model = "vertex/gemini-2.0-flash"hard_cap_samples = 500categories = {} # Custom category overrides
[ingestion.rag.structure.ontology]enabled = truerelationship_types = ["is_a", "part_of", "related_to", "causes"]entity_types = ["PERSON", "ORG", "CONCEPT", "EVENT"]validation_enabled = trueCLI Usage:
# Build taxonomy and ontologycontextrouter ingest structure --type book
# Custom model for analysiscontextrouter ingest structure --type qa --model vertex/gemini-2.0-flashOutput Files:
taxonomy.json - Hierarchical category structure:
{ "categories": [ { "name": "Artificial Intelligence", "children": [ { "name": "Machine Learning", "children": [ {"name": "Supervised Learning"}, {"name": "Unsupervised Learning"}, {"name": "Deep Learning"} ] }, {"name": "Natural Language Processing"}, {"name": "Computer Vision"} ] } ], "metadata": { "total_documents": 150, "confidence_score": 0.89, "created_at": "2024-01-15T10:30:00Z" }}ontology.json - Entity relationship schema:
{ "entities": [ { "type": "PERSON", "properties": ["name", "role", "affiliation"], "relationships": ["works_for", "collaborates_with"] }, { "type": "CONCEPT", "properties": ["definition", "examples"], "relationships": ["related_to", "part_of", "prerequisite_for"] } ], "relationships": [ { "name": "works_for", "domain": "PERSON", "range": "ORG", "description": "Employment relationship" } ]}Stage 3: Index - Entity Extraction & Graph Construction
The indexing stage performs deep analysis of content to extract entities, relationships, and build the knowledge graph that powers intelligent search and reasoning.
Entity Recognition Process:
-
Named Entity Recognition (NER)
- Identifies standard entities: Person, Organization, Location, Date, etc.
- Extracts domain-specific entities based on taxonomy
- Applies confidence scoring and disambiguation
- Links entities across document chunks
-
Key Phrase Extraction
- Identifies important multi-word phrases and concepts
- Uses combination of statistical and LLM-based methods
- Generates search-friendly keywords and tags
-
Content Enrichment
- Adds semantic metadata to each chunk
- Links to taxonomy categories
- Attaches confidence scores and provenance
Knowledge Graph Construction:
-
Entity Relationship Discovery
- Analyzes entity co-occurrence patterns
- Identifies semantic relationships using LLM analysis
- Creates graph edges with relationship types and confidence
-
Graph Integration
- Merges local document graphs into global knowledge graph
- Handles entity disambiguation across documents
- Maintains graph consistency and removes duplicates
-
Graph Enhancement
- Applies Cognee integration for advanced graph features
- Adds inferred relationships and transitive connections
- Optimizes graph structure for query performance
Shadow Record Generation:
-
Metadata Aggregation
- Combines all extracted information into searchable format
- Creates multiple representations for different search types
- Pre-computes frequently accessed fields
-
Search Optimization
- Generates keyword indexes for full-text search
- Creates vector-ready text representations
- Prepares citation metadata for result formatting
Configuration Options:
[ingestion.rag.index]enabled = trueincremental = false # Build on existing graph or start fresh
[ingestion.rag.index.ner]enabled = truemodel = "vertex/gemini-2.0-flash"entity_types = ["PERSON", "ORG", "GPE", "DATE", "MONEY"]confidence_threshold = 0.7
[ingestion.rag.index.graph]enabled = truebuilder_mode = "hybrid" # llm, local, hybridcognee_enabled = truemax_entities_per_chunk = 10relationship_types = ["related_to", "part_of", "causes"]
[ingestion.rag.index.shadow]enabled = trueinclude_keywords = trueinclude_entities = trueinclude_taxonomy = trueinclude_summary = truesummary_model = "vertex/gemini-2.0-flash"CLI Usage:
# Full indexing with all featurescontextrouter ingest index --type book
# Incremental indexing (preserve existing graph)contextrouter ingest index --type qa --incremental
# Custom NER modelcontextrouter ingest index --type knowledge --ner-model vertex/gemini-2.0-flashOutput Files:
knowledge_graph.pickle - Serialized knowledge graph:
# Graph contains nodes (entities) and edges (relationships)graph = { "nodes": [ {"id": "entity_001", "type": "PERSON", "name": "Alan Turing", "properties": {...}}, {"id": "entity_002", "type": "CONCEPT", "name": "Turing Machine", "properties": {...}} ], "edges": [ {"source": "entity_001", "target": "entity_002", "type": "invented", "confidence": 0.95} ]}shadow/{type}.jsonl - Enriched search records:
{ "id": "shadow_001", "content": "Alan Turing invented the Turing machine...", "metadata": { "source_type": "book", "page": 45, "entities": [ {"text": "Alan Turing", "type": "PERSON", "confidence": 0.98}, {"text": "Turing machine", "type": "CONCEPT", "confidence": 0.95} ], "keyphrases": ["Turing machine", "computational model", "theoretical computer"], "taxonomy": ["Computer Science", "Theory of Computation"], "summary": "Discussion of Alan Turing's invention of the Turing machine...", "relationships": [ {"entity1": "Alan Turing", "entity2": "Turing machine", "type": "invented"} ] }}Stage 4: Deploy - Index Population & Optimization
The deployment stage transfers processed data to your search infrastructure, making it available for real-time queries and RAG applications.
Format Conversion Process:
-
Target Format Selection
- Postgres: Converts to SQL INSERT statements with pgvector embeddings
- Vertex AI Search: Transforms to JSONL format for Vertex import
- Hybrid: Prepares data for multi-provider deployments
-
Data Serialization
- Serializes shadow records into provider-specific formats
- Handles large datasets with batching and streaming
- Preserves all metadata and provenance information
-
Embedding Generation
- Generates vector embeddings for semantic search
- Supports multiple embedding models (Vertex, OpenAI, local)
- Batch processes for efficiency with large datasets
Index Population:
-
Database Upload
- Postgres: Uses bulk INSERT with pgvector for vector storage
- Vertex AI Search: Uploads via Vertex AI Search API
- Incremental Updates: Supports partial updates without full rebuilds
-
Knowledge Graph Integration
- Uploads graph data to Cognee or Postgres KG
- Establishes cross-document relationships
- Enables graph-powered search features
-
Index Optimization
- Creates appropriate database indexes
- Optimizes for hybrid search (vector + keyword)
- Sets up partitioning for large datasets
Reporting & Validation:
-
Ingestion Report Generation
- Statistics: documents processed, entities extracted, relationships created
- Quality metrics: confidence scores, error rates
- Performance data: processing times, resource usage
-
Data Validation
- Verifies data integrity after upload
- Checks search functionality with sample queries
- Validates embedding quality and retrieval accuracy
Configuration Options:
[ingestion.rag.deploy]enabled = trueprovider = "postgres" # postgres, vertex, hybridbatch_size = 1000 # Records per batchmax_workers = 4 # Parallel upload workers
[ingestion.rag.deploy.embedding]enabled = truemodel = "vertex/text-embedding-004"batch_size = 100dimensions = 768
[ingestion.rag.deploy.validation]enabled = truesample_queries = 10accuracy_threshold = 0.8
[ingestion.rag.deploy.report]enabled = trueformat = "html" # html, json, markdowninclude_charts = trueinclude_samples = trueCLI Usage:
# Deploy to default providercontextrouter ingest deploy --type book
# Deploy to specific providercontextrouter ingest deploy --type qa --provider vertex
# Custom embedding modelcontextrouter ingest deploy --type knowledge \ --embedding-model vertex/text-embedding-004 \ --batch-size 500
# Skip validation for faster deploymentcontextrouter ingest deploy --type web --no-validationOutput Files:
Search Index - Data becomes searchable:
-- Example Postgres recordINSERT INTO documents ( id, content, embedding, metadata, entities, taxonomy) VALUES ( 'shadow_001', 'Alan Turing invented the Turing machine...', '[0.123, 0.456, ...]'::vector(768), '{"source_type": "book", "page": 45}'::jsonb, '["Alan Turing", "Turing machine"]'::text[], '["Computer Science", "Theory of Computation"]'::text[]);report.html - Comprehensive ingestion report:
<h1>Ingestion Report: book</h1>
<h2>Summary</h2><ul> <li>Documents processed: 25</li> <li>Chunks created: 1,247</li> <li>Entities extracted: 3,891</li> <li>Relationships created: 2,156</li> <li>Processing time: 45m 32s</li></ul>
<h2>Quality Metrics</h2><ul> <li>Average NER confidence: 87.3%</li> <li>Taxonomy coverage: 94.1%</li> <li>Graph connectivity: 78.5%</li></ul>Skipping Stages
Re-run only what you need:
# Skip preprocessing (already done)contextrouter ingest run --type book --skip-preprocess
# Skip structure (taxonomy exists)contextrouter ingest run --type book --skip-structure
# Only deploy (everything else done)contextrouter ingest deploy --type bookOutput Structure
After running the full pipeline:
ingestion_output/├── clean_text/│ ├── book.jsonl│ └── qa.jsonl├── taxonomy.json├── ontology.json├── knowledge_graph.pickle├── shadow/│ ├── book.jsonl│ └── qa.jsonl├── output/│ └── jsonl/│ └── book/│ └── book_001.jsonl└── report.htmlTroubleshooting Common Issues
Preprocessing Problems
Issue: PDF text extraction is garbled or missing content
Solution: Check PDF type - scanned documents need OCRcontextrouter ingest preprocess --type book --ocr-enabled --input document.pdfIssue: Video transcripts have incorrect timestamps
Solution: Use timestamp correction in video plugin[ingestion.rag.plugins.video]timestamp_correction = truespeaker_sync_enabled = trueIssue: Q&A speaker detection is inaccurate
Solution: Enable LLM-based speaker detection[ingestion.rag.plugins.qa]llm_speaker_detect_enabled = truellm_host_detect_enabled = trueStructure Stage Issues
Issue: Taxonomy categories are too generic or specific
Solution: Adjust taxonomy parameters[ingestion.rag.structure.taxonomy]max_categories = 8 # Increase for more specific categoriescategory_depth = 2 # Reduce for broader categoriesIssue: Ontology relationships are incorrect
Solution: Customize relationship types for your domain[ingestion.rag.structure.ontology]relationship_types = ["is_a", "part_of", "used_in", "related_to", "causes"]Index Stage Problems
Issue: NER is missing domain-specific entities
Solution: Add custom entity types[ingestion.rag.index.ner]entity_types = ["PERSON", "ORG", "PRODUCT", "TECHNOLOGY", "METHOD"]Issue: Knowledge graph has too many/too few connections
Solution: Adjust graph building parameters[ingestion.rag.index.graph]max_entities_per_chunk = 8 # Reduce for fewer connectionsmin_confidence = 0.5 # Increase for stricter relationshipsDeploy Stage Issues
Issue: Upload fails due to large dataset
Solution: Reduce batch size and increase workers[ingestion.rag.deploy]batch_size = 500max_workers = 2Issue: Embedding generation is slow
Solution: Use GPU-enabled embedding model or reduce dimensions[ingestion.rag.deploy.embedding]model = "local/all-MiniLM-L6-v2" # Faster local modeldimensions = 384 # Smaller embeddingsBest Practices
Performance Optimization
-
Chunk Size Tuning
- Smaller chunks (500-800 chars): Better precision, slower search
- Larger chunks (1000-1500 chars): Better context, faster search
- Test with your typical query lengths
-
Parallel Processing
[ingestion.rag]workers = 4 # Match your CPU cores -
Incremental Updates
Terminal window # Only process new/changed contentcontextrouter ingest run --type book --incremental
Quality Assurance
-
Validation Checks
[ingestion.rag.deploy.validation]enabled = truesample_queries = 20accuracy_threshold = 0.85 -
Regular Audits
- Review ingestion reports weekly
- Monitor entity extraction accuracy
- Validate taxonomy relevance
Data Management
-
Backup Strategy
- Keep raw source files for reprocessing
- Backup taxonomy.json and ontology.json
- Version control your configuration
-
Content Updates
Terminal window # Update existing contentcontextrouter ingest run --type book --overwrite
Advanced Usage
Custom Transformers
Create domain-specific transformers for specialized content:
from contextrouter.core.registry import register_transformerfrom contextrouter.modules.ingestion.rag.core.types import ShadowRecord
@register_transformer("medical_ner")class MedicalNERTransformer: """Extract medical entities and terminology."""
def transform(self, record: ShadowRecord) -> ShadowRecord: # Custom medical entity extraction medical_entities = self.extract_medical_terms(record.content) record.metadata["medical_entities"] = medical_entities record.add_trace("transformer:medical_ner") return recordMulti-Stage Pipelines
For complex workflows, run stages separately:
# Stage 1: Preprocess all content typescontextrouter ingest preprocess --type bookcontextrouter ingest preprocess --type qa
# Stage 2: Build unified taxonomycontextrouter ingest structure --type bookcontextrouter ingest structure --type qa
# Stage 3: Create integrated knowledge graphcontextrouter ingest index --type bookcontextrouter ingest index --type qa --incremental
# Stage 4: Deploy to productioncontextrouter ingest deploy --type bookcontextrouter ingest deploy --type qaLearn More
- Taxonomy & Ontology — How category and entity structures are built
- CLI Reference — All ingestion commands
- Configuration — Full ingestion settings