Metadata Extraction System
This guide explains how to use the metadata extraction system to generate navigation structures, search indexes, and cross-references from content metadata.
Overview
The metadata extraction system processes frontmatter metadata from all content files and generates:
- Search Indexes: Structured data for powering search functionality
- Navigation Structures: Automatically generated hierarchical navigation based on content organization
- Taxonomy Data: Classification of content by business area, tags, content level, and implementation difficulty
- Cross-References: Processed relationship data for related content, prerequisites, and next steps
- Content Maps: Comprehensive metadata for all content items
Using the Extraction System
Command Line Usage
The metadata extraction system can be run directly from the command line:
# Extract metadata with default options
node scripts/js/extract-metadata.js
# Extract metadata with custom options
node scripts/js/extract-metadata.js --dir=./custom-docs --output=./custom-output
# Generate only specific outputs
node scripts/js/extract-metadata.js --search-index --relations
Available Options
| Option | Description | Default |
|---|---|---|
--dir, -d | Directory to scan for Markdown files | ./docs |
--output, -o | Output directory for generated files | ./scripts/data |
--search-index, -s | Generate search index | true |
--navigation, -n | Generate navigation structure | true |
--taxonomy, -t | Generate taxonomy data | true |
--relations, -r | Include relationship data | true |
--verbose, -v | Enable verbose output | false |
Integration with Docusaurus
The metadata extraction system is automatically integrated with the Docusaurus build process through a custom plugin. You don't need to run it manually during builds.
The plugin is configured in docusaurus.config.js:
plugins: [
require.resolve('./plugins/metadata-extraction-plugin'),
]
This integration provides:
- Client-side access to metadata through aliases
- Preloaded search index for faster search initialization
- Dynamic routes for metadata-based navigation
- Tag and business area exploration pages
Generated Data Files
The system generates the following data files:
Content Map (content-map.json)
Contains comprehensive metadata for all content items, including:
- Basic information (ID, title, description)
- File paths and URL slugs
- Frontmatter data
- Extracted headings
- Normalized relationships
{
"content-id": {
"id": "content-id",
"path": "/path/to/file.md",
"title": "Content Title",
"description": "Content description...",
"slug": "/content-slug",
"headings": [
{ "level": 1, "text": "Heading 1" },
{ "level": 2, "text": "Heading 2" }
],
"relationships": {
"relatedPages": ["related-id-1", "related-id-2"],
"prerequisites": ["prereq-id"]
}
}
}
Search Index (search-index.json)
Optimized data structure for search functionality:
- Page entries with metadata
- Heading entries for in-page navigation
- Business area and tag information for filtering
[
{
"id": "content-id",
"type": "page",
"title": "Content Title",
"content": "Content description...",
"url": "/content-slug",
"tags": ["tag1", "tag2"],
"businessArea": "operations"
},
{
"id": "content-id-heading-0",
"type": "heading",
"title": "Heading 1",
"url": "/content-slug#heading-1",
"pageTitle": "Content Title"
}
]
Navigation (navigation.json)
Hierarchical structure for dynamic navigation menus:
{
"category-1": {
"title": "Category 1",
"items": {
"subcategory": {
"title": "Subcategory",
"items": {
"page-1": {
"title": "Page 1",
"path": "/page-1"
}
}
}
}
}
}
Taxonomy (taxonomy.json)
Classification of content by various attributes:
{
"businessAreas": {
"operations": [
{ "id": "content-id", "title": "Content Title", "url": "/content-slug" }
]
},
"tags": {
"tag1": [
{ "id": "content-id", "title": "Content Title", "url": "/content-slug" }
]
},
"contentLevels": {
"article": [
{ "id": "content-id", "title": "Content Title", "url": "/content-slug" }
]
}
}
Relationship Map (relationship-map.json)
Processed relationship data with resolved links:
{
"content-id": {
"id": "content-id",
"title": "Content Title",
"url": "/content-slug",
"relationships": {
"relatedPages": [
{
"id": "related-id-1",
"title": "Related Content 1",
"url": "/related-content-1",
"exists": true
}
]
}
}
}
Metadata Explorer
The metadata extraction system provides a Metadata Explorer interface at /metadata-explorer, which allows you to:
- Browse the content structure
- View relationships between content
- Explore content by tag or business area
- Search across all content
- Validate metadata consistency
Troubleshooting
Missing or Incomplete Data
If metadata is missing or incomplete in the generated files:
- Check that your content files have the necessary frontmatter
- Run the extraction script with
--verboseto see detailed output - Look for error messages in the console output
- Verify that the content files are properly formatted markdown (
.mdor.mdx)
Integration Issues
If the Docusaurus integration isn't working:
- Ensure the plugin is properly configured in
docusaurus.config.js - Check that the referenced component files exist
- Look for errors in the Docusaurus build logs
- Try running the extraction script manually to see if it works
Performance Concerns
For large documentation sets, extraction might become slow. To optimize:
- Use more specific directory targeting with the
--diroption - Disable generation of unused outputs (e.g.,
--taxonomy=false) - Consider splitting documentation into multiple repositories
Extending the System
The metadata extraction system is designed to be extensible. You can:
- Add new output generators in
extract-metadata.js - Create custom visualization components for the extracted data
- Build additional integrations for the generated data
- Extend the taxonomy with additional classification dimensions