The Crawler Engine

The Crawler Engine is the structural foundation of Moltext. It is responsible for the recursive discovery and retrieval of web pages, transforming a single entry point into a comprehensive set of raw HTML data for the Processor to digest.

Designed for documentation sites, the engine prioritizes breadth-first traversal while maintaining strict boundaries to prevent "crawler drift."

Domain-Locked Traversal

To ensure the resulting context.md remains relevant and deterministic, the Crawler employs a strict domain lock.

Hostname Enforcement: The engine extracts the hostname from the initial URL. Any links discovered during the crawl that point to a different domain or subdomain are automatically discarded.
Protocol Filtering: Only http: and https: protocols are supported.
Hash Removal: URL fragments (e.g., docs.com/setup#requirements) are normalized to their base URL to prevent redundant parsing of the same page.

Recursive Link Discovery

The Crawler uses a queue-based system to navigate documentation hierarchies. It utilizes cheerio to parse the DOM of every successfully fetched page, extracting all <a> tags to find new traversal paths.

Normalization: Relative paths (e.g., /api/auth) are resolved against the current page URL into absolute URLs.
Deduplication: The engine maintains a visited set to ensure no page is processed more than once, even if linked from multiple sections.
Content Validation: Before parsing, the engine verifies the Content-Type header. Only text/html resources are enqueued; binary files, PDFs, and images are ignored to preserve agent context density.

Configuration & Limits

You can control the depth and breadth of the engine via the CLI or the programmatic interface.

CLI Usage

Use the --limit (or -l) flag to set a safety ceiling on the number of pages retrieved. This is critical for massive documentation sites where you only need the core API reference.

# Limit the crawl to the first 50 discovered pages
moltext https://docs.example.com --limit 50

Programmatic Interface (Technical)

If you are integrating the Crawler class directly, the crawl method returns a Promise<Page[]> and accepts an optional progress callback.

import { Crawler } from './src/crawler';

const crawler = new Crawler('https://docs.example.com');

const pages = await crawler.crawl(100, (url) => {
    console.log(`Discovered: ${url}`);
});

Types:

maxPages (number): The maximum number of pages to crawl before stopping.
onUrlFound (callback): An optional function triggered every time a new URL is pulled from the queue for processing.

Technical Constraints

User-Agent: The crawler identifies as Moltext/1.0.
Timeout: Requests have a hard-coded 10-second timeout to prevent the compilation process from hanging on dead links.
Error Handling: If a specific page fails to load (404, 500, or Timeout), the engine logs the failure internally and continues to the next item in the queue to ensure the maximum amount of context is still gathered.