Development Guide
Getting Started
Moltext is built with TypeScript and designed to be modular. You can either contribute to the CLI tool or integrate the core Crawler and Processor engines directly into your own Node.js applications.
Prerequisites
- Node.js: v18.0.0 or higher
- Package Manager: npm or yarn
- API Access: An OpenAI-compatible API key (optional for
--rawmode)
Local Environment Setup
-
Clone the repository:
git clone https://github.com/UditAkhourii/moltext.git cd moltext -
Install dependencies:
npm install -
Configure Environment: Create a
.envfile in the root directory to store your credentials:OPENAI_API_KEY=your_key_here -
Build the project:
npm run build
Programmatic Usage
Moltext exposes two primary classes: the Crawler for content discovery and the Processor for content transformation.
The Crawler Class
The Crawler handles recursive link discovery within a specific domain and retrieves raw HTML content.
import { Crawler, Page } from './src/crawler';
const crawler = new Crawler('https://docs.example.com');
// crawl(maxPages, onUrlFound callback)
const pages: Page[] = await crawler.crawl(50, (url) => {
console.log(`Discovered: ${url}`);
});
Interface: Page
| Property | Type | Description |
| :--- | :--- | :--- |
| url | string | The absolute URL of the page. |
| content | string | The raw HTML content. |
| title | string | The page title (extracted from <title>). |
The Processor Class
The Processor cleans HTML, converts it to Markdown using Turndown, and optionally enhances it via LLM.
import { Processor } from './src/processor';
const processor = new Processor(
process.env.OPENAI_API_KEY,
'https://api.openai.com/v1',
'gpt-4o-mini'
);
// processPage(page, isRawMode)
const agenticMarkdown = await processor.processPage(pages[0], false);
Core Architecture
Content Normalization
The Processor includes a cleanHtml method (internal) that automatically strips noise before conversion. It removes:
<script>,<style>,<iframe>, and<noscript>tags.- Navigation elements (
<nav>,[role="navigation"],.nav). - Footers and Sidebars (
<footer>,.footer,.sidebar). - It prioritizes content within
<main>or<article>tags.
LLM Orchestration
When not in --raw mode, the Processor uses a high-density prompt to transform standard Markdown into "Agentic Context." This involves:
- Stripping conversational filler.
- Hard-preserving code signatures and technical constraints.
- Optimizing for vector retrieval (RAG).
Development Workflow
Running in Development
To run the CLI directly from source without manual transpilation:
npx ts-node src/index.ts https://docs.example.com --raw
Extending Processing Logic
If you need to add custom cleaning rules (e.g., removing specific cookie banners or headers found in a particular documentation framework), modify the cleanHtml method in src/processor.ts:
// Example: Adding a custom selector to strip
private cleanHtml(html: string): string {
const $ = cheerio.load(html);
$('.custom-cookie-banner').remove();
$('.v-announcement-bar').remove();
// ... existing cleaning logic
}
Building for Production
To generate the production-ready JavaScript files in the dist directory:
npm run build
After building, you can link the package locally to test the global moltext command:
npm link