Metadata Extraction System

This guide explains how to use the metadata extraction system to generate navigation structures, search indexes, and cross-references from content metadata.

Overview

The metadata extraction system processes frontmatter metadata from all content files and generates:

Search Indexes: Structured data for powering search functionality
Navigation Structures: Automatically generated hierarchical navigation based on content organization
Taxonomy Data: Classification of content by business area, tags, content level, and implementation difficulty
Cross-References: Processed relationship data for related content, prerequisites, and next steps
Content Maps: Comprehensive metadata for all content items

Using the Extraction System

Command Line Usage

The metadata extraction system can be run directly from the command line:

# Extract metadata with default options
node scripts/js/extract-metadata.js

# Extract metadata with custom options
node scripts/js/extract-metadata.js --dir=./custom-docs --output=./custom-output

# Generate only specific outputs
node scripts/js/extract-metadata.js --search-index --relations

Available Options

Option	Description	Default
`--dir`, `-d`	Directory to scan for Markdown files	`./docs`
`--output`, `-o`	Output directory for generated files	`./scripts/data`
`--search-index`, `-s`	Generate search index	`true`
`--navigation`, `-n`	Generate navigation structure	`true`
`--taxonomy`, `-t`	Generate taxonomy data	`true`
`--relations`, `-r`	Include relationship data	`true`
`--verbose`, `-v`	Enable verbose output	`false`

Integration with Docusaurus

The metadata extraction system is automatically integrated with the Docusaurus build process through a custom plugin. You don't need to run it manually during builds.

The plugin is configured in docusaurus.config.js:

plugins: [
  require.resolve('./plugins/metadata-extraction-plugin'),
]

This integration provides:

Client-side access to metadata through aliases
Preloaded search index for faster search initialization
Dynamic routes for metadata-based navigation
Tag and business area exploration pages

Generated Data Files

The system generates the following data files:

Content Map (`content-map.json`)

Contains comprehensive metadata for all content items, including:

Basic information (ID, title, description)
File paths and URL slugs
Frontmatter data
Extracted headings
Normalized relationships

{
  "content-id": {
    "id": "content-id",
    "path": "/path/to/file.md",
    "title": "Content Title",
    "description": "Content description...",
    "slug": "/content-slug",
    "headings": [
      { "level": 1, "text": "Heading 1" },
      { "level": 2, "text": "Heading 2" }
    ],
    "relationships": {
      "relatedPages": ["related-id-1", "related-id-2"],
      "prerequisites": ["prereq-id"]
    }
  }
}

Search Index (`search-index.json`)

Optimized data structure for search functionality:

Page entries with metadata
Heading entries for in-page navigation
Business area and tag information for filtering

[
  {
    "id": "content-id",
    "type": "page",
    "title": "Content Title",
    "content": "Content description...",
    "url": "/content-slug",
    "tags": ["tag1", "tag2"],
    "businessArea": "operations"
  },
  {
    "id": "content-id-heading-0",
    "type": "heading",
    "title": "Heading 1",
    "url": "/content-slug#heading-1",
    "pageTitle": "Content Title"
  }
]

Navigation (`navigation.json`)

Hierarchical structure for dynamic navigation menus:

{
  "category-1": {
    "title": "Category 1",
    "items": {
      "subcategory": {
        "title": "Subcategory",
        "items": {
          "page-1": {
            "title": "Page 1",
            "path": "/page-1"
          }
        }
      }
    }
  }
}

Taxonomy (`taxonomy.json`)

Classification of content by various attributes:

{
  "businessAreas": {
    "operations": [
      { "id": "content-id", "title": "Content Title", "url": "/content-slug" }
    ]
  },
  "tags": {
    "tag1": [
      { "id": "content-id", "title": "Content Title", "url": "/content-slug" }
    ]
  },
  "contentLevels": {
    "article": [
      { "id": "content-id", "title": "Content Title", "url": "/content-slug" }
    ]
  }
}

Relationship Map (`relationship-map.json`)

Processed relationship data with resolved links:

{
  "content-id": {
    "id": "content-id",
    "title": "Content Title",
    "url": "/content-slug",
    "relationships": {
      "relatedPages": [
        {
          "id": "related-id-1",
          "title": "Related Content 1",
          "url": "/related-content-1",
          "exists": true
        }
      ]
    }
  }
}

Metadata Explorer

The metadata extraction system provides a Metadata Explorer interface at /metadata-explorer, which allows you to:

Browse the content structure
View relationships between content
Explore content by tag or business area
Search across all content
Validate metadata consistency

Troubleshooting

Missing or Incomplete Data

If metadata is missing or incomplete in the generated files:

Check that your content files have the necessary frontmatter
Run the extraction script with --verbose to see detailed output
Look for error messages in the console output
Verify that the content files are properly formatted markdown (.md or .mdx)

Integration Issues

If the Docusaurus integration isn't working:

Ensure the plugin is properly configured in docusaurus.config.js
Check that the referenced component files exist
Look for errors in the Docusaurus build logs
Try running the extraction script manually to see if it works

Performance Concerns

For large documentation sets, extraction might become slow. To optimize:

Use more specific directory targeting with the --dir option
Disable generation of unused outputs (e.g., --taxonomy=false)
Consider splitting documentation into multiple repositories

Extending the System

The metadata extraction system is designed to be extensible. You can:

Add new output generators in extract-metadata.js
Create custom visualization components for the extracted data
Build additional integrations for the generated data
Extend the taxonomy with additional classification dimensions