The Processing Pipeline
The Processing Pipeline
Moltext employs a multi-stage pipeline to transform "human-first" web documentation into a high-density, deterministic format optimized for LLM context windows. The pipeline transitions data through four distinct phases: Extraction, Sanitization, Conversion, and Agentic Optimization.
1. Extraction and Discovery
The pipeline begins with the Crawler engine. Starting from a base URL, it recursively discovers internal links within the same domain.
- Normalization: All URLs are normalized to remove fragments (hashes) and ensure protocol consistency.
- Depth Control: The crawler respects the
--limitflag to prevent runaway processes on massive documentation sites. - Concurrency: Pages are fetched using
axioswith a standardMoltext/1.0User-Agent to ensure compatibility with most documentation hosting providers.
2. Semantic Sanitization
Once HTML content is retrieved, the Processor invokes a cleaning phase using cheerio. This phase is critical for reducing "noise" (tokens that provide no value to an AI agent).
The following elements are aggressively stripped:
- Boilerplate:
<nav>,<footer>,<script>, and<style>tags. - Navigation Noise: Elements with roles of
navigation, sidebars, and noscript blocks. - Layout Junk: If a
<main>or<article>tag is present, Moltext extracts only that content, discarding the remaining DOM structure.
3. Markdown Transformation
The sanitized HTML is converted into Markdown using the Turndown service. Moltext is configured with specific defaults to ensure the output remains predictable for agentic parsing:
- Heading Style: ATX (e.g.,
## Header) for better hierarchy recognition. - Code Blocks: Fenced blocks (```) to preserve language identifiers and syntax.
If the --raw flag is used, the pipeline terminates here, appending the source URL and title to the markdown and writing it to the output file.
4. Agentic Optimization (LLM Layer)
In the default mode, the raw Markdown is passed through an LLM (e.g., gpt-4o-mini or a local llama3 instance). This stage applies "Structural Compression" to the data.
The LLM is tasked with a strict system prompt to:
- Strip Conversational Filler: Removes phrases like "In this tutorial, we will learn how to..." or "Welcome back to our guide."
- Optimize for Vector Retrieval: Increases keyword density and clarifies logic structures.
- Preserve Technical Ground Truth: Strictly preserves all API signatures, code blocks, and technical constraints.
- Normalize Structure: Fixes broken markdown syntax from the initial conversion.
Example Transformation Logic
// Internal logic for the LLM compression prompt
const systemPrompt = `
1. Extremely high-density and concise.
2. Optimized for vector retrieval.
3. Stripped of all conversational filler.
4. Strictly preserving ALL code blocks and signatures.
`;
5. Final Compilation
The processed pages are batched (default size: 5) to respect rate limits and then merged into a single context.md file. Each section is prefixed with its source metadata, providing the agent with a "ground-truth" link for every piece of information in its memory.
| Phase | Input | Tooling | Output | | :--- | :--- | :--- | :--- | | Ingestion | URL | Axios/Cheerio | Raw HTML | | Cleaning | Raw HTML | Cheerio | Sanitized HTML | | Conversion | Sanitized HTML | Turndown | Standard Markdown | | Optimization | Standard Markdown | OpenAI/Local LLM | Agentic Context |