Content Normalization
Content Normalization
Moltext transforms chaotic, "human-first" web documentation into a deterministic, high-density Markdown format optimized for agentic reasoning. This process, handled by the Processor engine, ensures that noise is eliminated while critical technical signatures are preserved.
The Normalization Pipeline
The normalization process follows a three-stage pipeline to ensure the output is both clean and structurally sound:
- Heuristic Noise Reduction: Stripping non-essential HTML elements.
- Structural Synthesis: Converting the cleaned DOM into standard Markdown.
- Agentic Refinement (Optional): Using LLMs to compress and optimize the text for vector retrieval.
1. Heuristic Noise Reduction
To prevent "context drift," Moltext identifies and removes web-native clutter that provides no value to an AI agent. Using cheerio, the processor targets and excises the following elements:
- Global Navigation:
<nav>,.nav,.sidebar. - Boilerplate:
<footer>,.footer. - Executables & Styling:
<script>,<style>,<noscript>. - Embedded Media:
<iframe>. - Accessibility Artifacts: Elements with
role="navigation".
The engine prioritizes content within <main> or <article> tags. If these are not present, it falls back to the <body> content to ensure no critical information is missed.
2. Markdown Synthesis
Once the HTML is cleaned, Moltext uses Turndown to synthesize a structural Markdown document. This stage ensures that:
- Headers use ATX style (
# H1,## H2) for clear hierarchy. - Code Blocks are fenced with appropriate language identifiers.
- Links are preserved to maintain the documentation's relational integrity.
3. Agentic Refinement (LLM Mode)
By default, Moltext passes the synthesized Markdown through an LLM (e.g., gpt-4o-mini or a local llama3 instance) to perform "Agentic Refinement." This step applies a strict system prompt to:
- Strip Conversational Filler: Removes "In this section, you will learn..." and other human-centric fluff.
- Enhance Density: Increases the information-per-token ratio.
- Preserve Signatures: Ensures API signatures, constants, and technical constraints remain untouched.
- Fix Broken Syntax: Corrects any malformed Markdown resulting from complex HTML-to-MD conversions.
Raw Mode vs. Enhanced Mode
You can control the depth of normalization using the --raw flag.
Enhanced Mode (Default)
Best for general tool understanding and memory expansion.
moltext https://docs.example.com -k your-api-key
- Output: Compressed, high-density, agent-optimized Markdown.
- Usage: Ideal for feeding into an agent's long-term memory or RAG (Retrieval-Augmented Generation) pipeline.
Raw Mode (--raw)
Best for debugging or when you require the exact "ground truth" of the source documentation without any LLM interference.
moltext https://docs.example.com --raw
- Output: Pure structural Markdown directly from the source HTML.
- Usage: Ideal for local processing, high-fidelity technical audits, or when running in resource-constrained environments where LLM calls are not feasible.
Integration Example
If you are extending the Processor class directly in a TypeScript project:
import { Processor } from './processor';
// Initialize with optional LLM configuration
const processor = new Processor(apiKey, 'https://api.openai.com/v1', 'gpt-4o-mini');
// Process a crawled page
const result = await processor.processPage({
url: 'https://docs.example.com/api',
content: '<html>...</html>',
title: 'API Reference'
}, false); // Set to true for Raw Mode