System Architecture

Architecture Overview

Moltext is designed as a high-throughput documentation pipeline that transforms fractured, human-centric web content into a singular, high-density Agentic Context File. The system architecture is built on a decoupled "Parse-Process-Compile" pattern, ensuring that documentation is normalized before being presented to an LLM or an Autonomous Agent.

The Documentation Pipeline

The compilation process follows a linear data flow, moving from a remote URL to a local deterministic markdown file:

Discovery (Crawler): Traverses the target domain to identify and fetch HTML content.
Extraction (Cheerio/Turndown): Strips non-content elements (scripts, nav, footers) and converts HTML to Markdown.
Normalization (Processor): (Optional) Uses LLM inference to compress the markdown and remove conversational filler.
Assembly: Merges all processed pages into a single context.md with source attribution.

Core Components

1. Crawler (Discovery Engine)

The Crawler is responsible for stateful navigation of documentation sites. It implements a breadth-first search algorithm restricted to the initial domain to prevent "crawler sprawl."

Domain Locking: Automatically extracts the hostname from the base URL and ignores any external links.
Link Normalization: Resolves relative paths and strips URL fragments to ensure each unique page is only parsed once.
Concurrency Control: Supports batching (default size: 5) to respect rate limits of documentation hosts.

2. Processor (Transformation Engine)

The Processor handles the conversion of raw HTML into agent-readable markdown. It operates in two distinct modes:

Raw Mode (`--raw`)

The processor performs a deterministic transformation:

DOM Cleaning: Uses Cheerio to remove nav, footer, script, and style tags.
Markdown Synthesis: Uses Turndown to convert the remaining semantic HTML (headers, code blocks, tables) into clean Markdown.
Context Preservation: Ensures all code signatures and technical constraints remain untouched.

AI-Native Mode (LLM-Enhanced)

When an API key or local inference endpoint is provided, the processor passes the markdown through a "Refinement Prompt." This prompt instructs the LLM to:

Remove conversational "fluff" and marketing language.
Optimize text for vector retrieval and high-density logic.
Fix broken formatting resulting from complex HTML structures.

3. Orchestrator (CLI)

The CLI layer, built on commander, manages user configuration and environment variables. It handles authentication logic for OpenAI or local providers (like Ollama) and coordinates the hand-off between the Crawler and the Processor.

Data Flow Diagram

Key Interfaces

Crawler Configuration

The crawler is initialized with a base URL and manages an internal queue of discovered links.

interface Page {
    url: string;
    content: string; // Raw HTML
    title: string;
}

// Example Crawler initialization
const crawler = new Crawler("https://docs.example.com");
const pages = await crawler.crawl(limit);

Processor Configuration

The processor is model-agnostic and can be pointed to any OpenAI-compatible endpoint.

const processor = new Processor(
    apiKey, 
    "https://api.openai.com/v1", // Or local base-url
    "gpt-4o-mini"                // Specified model
);

// Processing a page
const agenticMarkdown = await processor.processPage(page, isRawMode);

Integration with ClawHub

Moltext serves as the Ingestion Layer for the ClawHub ecosystem. By outputting a standardized context.md, it allows downstream agents (like OpenClaw) to expand their memory window with ground-truth technical data without requiring manual documentation browsing. This "Agent-Native" approach treats documentation as a compiled binary rather than a collection of readable pages.