Skip to main content

Metadata Extraction System

This guide explains how to use the metadata extraction system to generate navigation structures, search indexes, and cross-references from content metadata.

Overview

The metadata extraction system processes frontmatter metadata from all content files and generates:

  1. Search Indexes: Structured data for powering search functionality
  2. Navigation Structures: Automatically generated hierarchical navigation based on content organization
  3. Taxonomy Data: Classification of content by business area, tags, content level, and implementation difficulty
  4. Cross-References: Processed relationship data for related content, prerequisites, and next steps
  5. Content Maps: Comprehensive metadata for all content items

Using the Extraction System

Command Line Usage

The metadata extraction system can be run directly from the command line:

# Extract metadata with default options
node scripts/js/extract-metadata.js

# Extract metadata with custom options
node scripts/js/extract-metadata.js --dir=./custom-docs --output=./custom-output

# Generate only specific outputs
node scripts/js/extract-metadata.js --search-index --relations

Available Options

OptionDescriptionDefault
--dir, -dDirectory to scan for Markdown files./docs
--output, -oOutput directory for generated files./scripts/data
--search-index, -sGenerate search indextrue
--navigation, -nGenerate navigation structuretrue
--taxonomy, -tGenerate taxonomy datatrue
--relations, -rInclude relationship datatrue
--verbose, -vEnable verbose outputfalse

Integration with Docusaurus

The metadata extraction system is automatically integrated with the Docusaurus build process through a custom plugin. You don't need to run it manually during builds.

The plugin is configured in docusaurus.config.js:

plugins: [
  require.resolve('./plugins/metadata-extraction-plugin'),
]

This integration provides:

  • Client-side access to metadata through aliases
  • Preloaded search index for faster search initialization
  • Dynamic routes for metadata-based navigation
  • Tag and business area exploration pages

Generated Data Files

The system generates the following data files:

Content Map (content-map.json)

Contains comprehensive metadata for all content items, including:

  • Basic information (ID, title, description)
  • File paths and URL slugs
  • Frontmatter data
  • Extracted headings
  • Normalized relationships
{
  "content-id": {
    "id": "content-id",
    "path": "/path/to/file.md",
    "title": "Content Title",
    "description": "Content description...",
    "slug": "/content-slug",
    "headings": [
      { "level": 1, "text": "Heading 1" },
      { "level": 2, "text": "Heading 2" }
    ],
    "relationships": {
      "relatedPages": ["related-id-1", "related-id-2"],
      "prerequisites": ["prereq-id"]
    }
  }
}

Search Index (search-index.json)

Optimized data structure for search functionality:

  • Page entries with metadata
  • Heading entries for in-page navigation
  • Business area and tag information for filtering
[
  {
    "id": "content-id",
    "type": "page",
    "title": "Content Title",
    "content": "Content description...",
    "url": "/content-slug",
    "tags": ["tag1", "tag2"],
    "businessArea": "operations"
  },
  {
    "id": "content-id-heading-0",
    "type": "heading",
    "title": "Heading 1",
    "url": "/content-slug#heading-1",
    "pageTitle": "Content Title"
  }
]

Hierarchical structure for dynamic navigation menus:

{
  "category-1": {
    "title": "Category 1",
    "items": {
      "subcategory": {
        "title": "Subcategory",
        "items": {
          "page-1": {
            "title": "Page 1",
            "path": "/page-1"
          }
        }
      }
    }
  }
}

Taxonomy (taxonomy.json)

Classification of content by various attributes:

{
  "businessAreas": {
    "operations": [
      { "id": "content-id", "title": "Content Title", "url": "/content-slug" }
    ]
  },
  "tags": {
    "tag1": [
      { "id": "content-id", "title": "Content Title", "url": "/content-slug" }
    ]
  },
  "contentLevels": {
    "article": [
      { "id": "content-id", "title": "Content Title", "url": "/content-slug" }
    ]
  }
}

Relationship Map (relationship-map.json)

Processed relationship data with resolved links:

{
  "content-id": {
    "id": "content-id",
    "title": "Content Title",
    "url": "/content-slug",
    "relationships": {
      "relatedPages": [
        {
          "id": "related-id-1",
          "title": "Related Content 1",
          "url": "/related-content-1",
          "exists": true
        }
      ]
    }
  }
}

Metadata Explorer

The metadata extraction system provides a Metadata Explorer interface at /metadata-explorer, which allows you to:

  • Browse the content structure
  • View relationships between content
  • Explore content by tag or business area
  • Search across all content
  • Validate metadata consistency

Troubleshooting

Missing or Incomplete Data

If metadata is missing or incomplete in the generated files:

  1. Check that your content files have the necessary frontmatter
  2. Run the extraction script with --verbose to see detailed output
  3. Look for error messages in the console output
  4. Verify that the content files are properly formatted markdown (.md or .mdx)

Integration Issues

If the Docusaurus integration isn't working:

  1. Ensure the plugin is properly configured in docusaurus.config.js
  2. Check that the referenced component files exist
  3. Look for errors in the Docusaurus build logs
  4. Try running the extraction script manually to see if it works

Performance Concerns

For large documentation sets, extraction might become slow. To optimize:

  1. Use more specific directory targeting with the --dir option
  2. Disable generation of unused outputs (e.g., --taxonomy=false)
  3. Consider splitting documentation into multiple repositories

Extending the System

The metadata extraction system is designed to be extensible. You can:

  1. Add new output generators in extract-metadata.js
  2. Create custom visualization components for the extracted data
  3. Build additional integrations for the generated data
  4. Extend the taxonomy with additional classification dimensions