Taxonomy & Ontology
Taxonomy and ontology provide structure to your knowledge base, enabling filtered search, relationship-aware retrieval, and more accurate categorization.
Understanding the Concepts
What is Taxonomy?
A taxonomy is a hierarchical category tree that organizes your content:
Knowledge Base├── Technology│ ├── Artificial Intelligence│ │ ├── Machine Learning│ │ │ ├── Supervised Learning│ │ │ ├── Unsupervised Learning│ │ │ └── Reinforcement Learning│ │ └── Natural Language Processing│ │ ├── Text Classification│ │ └── Question Answering│ └── Software Engineering│ ├── Backend│ └── Frontend├── Business│ ├── Strategy│ └── Operations└── Science └── PhysicsWhat is Ontology?
An ontology defines the types of entities in your domain and how they can relate:
# Entity typesentities: - Person - Organization - Technology - Concept - Location
# Allowed relationshipsrelations: - (Person, works_at, Organization) - (Person, created, Technology) - (Technology, used_by, Organization) - (Technology, part_of, Technology) - (Concept, related_to, Concept)Building Taxonomy
Automatic Generation (LLM-Based)
ContextRouter can automatically discover categories from your content:
contextrouter ingest structure --type bookThe process:
- Sample representative chunks from your documents
- Analyze with LLM to identify topics and themes
- Cluster similar concepts
- Build hierarchical tree
Configuration:
[ingestion.rag.taxonomy]builder = "llm"max_depth = 4 # Maximum hierarchy depthmin_samples_per_category = 3 # Minimum docs per categorysampling_rate = 0.1 # Percentage of chunks to analyzeManual Taxonomy
Provide your own taxonomy for controlled vocabularies:
{ "Technology": { "AI": { "Machine Learning": { "Deep Learning": {}, "Classical ML": {} }, "NLP": {} }, "Cloud": { "AWS": {}, "GCP": {}, "Azure": {} } }, "Business": { "Finance": {}, "Marketing": {} }}[ingestion.rag.taxonomy]builder = "manual"taxonomy_path = "./taxonomy.json"Hybrid Approach
Start with manual top-level categories, auto-expand deeper levels:
[ingestion.rag.taxonomy]builder = "hybrid"seed_taxonomy_path = "./seed_taxonomy.json"auto_expand_depth = 2 # Auto-generate 2 levels below seedUsing Taxonomy in Retrieval
Filtered Search
Query only specific categories:
results = await pipeline.retrieve( query="best practices", taxonomy_filter="Technology.AI.Machine Learning")Concept Extraction
During intent detection, taxonomy concepts are automatically extracted:
# User: "How do I train a transformer model?"
# Extracted concepts:taxonomy_concepts = [ "Technology.AI.Machine Learning.Deep Learning", "Technology.AI.NLP"]These concepts:
- Filter retrieval to relevant categories
- Guide knowledge graph lookups
- Improve reranking relevance
Building Ontology
Automatic Extraction
The ontology is built alongside taxonomy:
contextrouter ingest structure --type bookThe LLM identifies:
- Entity types — What kinds of things appear in your content?
- Relation types — How do these things connect?
- Constraints — Which relations make sense between which entities?
Manual Ontology
Define your own schema:
{ "entities": [ {"name": "Person", "description": "A human individual"}, {"name": "Company", "description": "A business organization"}, {"name": "Product", "description": "A software or physical product"}, {"name": "Technology", "description": "A technical concept or tool"} ], "relations": [ {"name": "works_at", "source": "Person", "target": "Company"}, {"name": "founded", "source": "Person", "target": "Company"}, {"name": "created", "source": "Person", "target": "Product"}, {"name": "uses", "source": "Product", "target": "Technology"}, {"name": "competes_with", "source": "Company", "target": "Company"} ]}Entity Extraction Example
With the ontology defined, the ingestion pipeline extracts entities:
Input text:"Sam Altman, CEO of OpenAI, announced GPT-5 at the 2024 conference.The new model uses transformer architecture and was trained onAzure's infrastructure."
Extracted entities:- Sam Altman (Person)- OpenAI (Company)- GPT-5 (Product)- transformer (Technology)- Azure (Company)
Extracted relations:- (Sam Altman, works_at, OpenAI)- (OpenAI, created, GPT-5)- (GPT-5, uses, transformer)- (OpenAI, uses, Azure)These become nodes and edges in your knowledge graph.
Output Files
After running contextrouter ingest structure:
ingestion_output/├── taxonomy.json # Category hierarchy├── ontology.json # Entity/relation schema├── taxonomy_stats.json # Distribution statistics└── ontology_examples.json # Example extractionsConfiguration Reference
[ingestion.rag.taxonomy]# Builder type: "llm", "manual", or "hybrid"builder = "llm"
# For manual/hybridtaxonomy_path = "./taxonomy.json"seed_taxonomy_path = "./seed.json"
# LLM generation settingsmax_depth = 4min_samples_per_category = 3sampling_rate = 0.1merge_similar_threshold = 0.8
[ingestion.rag.ontology]# Entity types to extractentity_types = ["Person", "Organization", "Technology", "Concept"]
# Relation types to identifyrelation_types = ["works_at", "created", "uses", "part_of", "related_to"]
# Extraction settingsmax_entities_per_chunk = 10confidence_threshold = 0.7Best Practices
-
Start broad, refine later — Let the LLM discover categories, then manually curate
-
Balance depth vs. sparsity — Too deep = sparse categories; too shallow = poor filtering
-
Review extracted entities — Spot-check the ontology examples before full extraction
-
Iterate on taxonomy — Run generation multiple times, compare results
-
Version your schemas — Keep taxonomy.json and ontology.json in version control
-
Test filtered retrieval — Verify that taxonomy filters actually improve results