The Crawler Engine
The Crawler Engine
The Crawler Engine is the structural foundation of Moltext. It is responsible for the recursive discovery and retrieval of web pages, transforming a single entry point into a comprehensive set of raw HTML data for the Processor to digest.
Designed for documentation sites, the engine prioritizes breadth-first traversal while maintaining strict boundaries to prevent "crawler drift."
Domain-Locked Traversal
To ensure the resulting context.md remains relevant and deterministic, the Crawler employs a strict domain lock.
- Hostname Enforcement: The engine extracts the
hostnamefrom the initial URL. Any links discovered during the crawl that point to a different domain or subdomain are automatically discarded. - Protocol Filtering: Only
http:andhttps:protocols are supported. - Hash Removal: URL fragments (e.g.,
docs.com/setup#requirements) are normalized to their base URL to prevent redundant parsing of the same page.
Recursive Link Discovery
The Crawler uses a queue-based system to navigate documentation hierarchies. It utilizes cheerio to parse the DOM of every successfully fetched page, extracting all <a> tags to find new traversal paths.
- Normalization: Relative paths (e.g.,
/api/auth) are resolved against the current page URL into absolute URLs. - Deduplication: The engine maintains a
visitedset to ensure no page is processed more than once, even if linked from multiple sections. - Content Validation: Before parsing, the engine verifies the
Content-Typeheader. Onlytext/htmlresources are enqueued; binary files, PDFs, and images are ignored to preserve agent context density.
Configuration & Limits
You can control the depth and breadth of the engine via the CLI or the programmatic interface.
CLI Usage
Use the --limit (or -l) flag to set a safety ceiling on the number of pages retrieved. This is critical for massive documentation sites where you only need the core API reference.
# Limit the crawl to the first 50 discovered pages
moltext https://docs.example.com --limit 50
Programmatic Interface (Technical)
If you are integrating the Crawler class directly, the crawl method returns a Promise<Page[]> and accepts an optional progress callback.
import { Crawler } from './src/crawler';
const crawler = new Crawler('https://docs.example.com');
const pages = await crawler.crawl(100, (url) => {
console.log(`Discovered: ${url}`);
});
Types:
maxPages(number): The maximum number of pages to crawl before stopping.onUrlFound(callback): An optional function triggered every time a new URL is pulled from the queue for processing.
Technical Constraints
- User-Agent: The crawler identifies as
Moltext/1.0. - Timeout: Requests have a hard-coded 10-second timeout to prevent the compilation process from hanging on dead links.
- Error Handling: If a specific page fails to load (404, 500, or Timeout), the engine logs the failure internally and continues to the next item in the queue to ensure the maximum amount of context is still gathered.