Development Guide

Getting Started

Moltext is built with TypeScript and designed to be modular. You can either contribute to the CLI tool or integrate the core Crawler and Processor engines directly into your own Node.js applications.

Prerequisites

Node.js: v18.0.0 or higher
Package Manager: npm or yarn
API Access: An OpenAI-compatible API key (optional for --raw mode)

Local Environment Setup

Clone the repository:

git clone https://github.com/UditAkhourii/moltext.git
cd moltext

Install dependencies:
```
npm install
```
Configure Environment: Create a .env file in the root directory to store your credentials:
```
OPENAI_API_KEY=your_key_here
```
Build the project:
```
npm run build
```

Programmatic Usage

Moltext exposes two primary classes: the Crawler for content discovery and the Processor for content transformation.

The Crawler Class

The Crawler handles recursive link discovery within a specific domain and retrieves raw HTML content.

import { Crawler, Page } from './src/crawler';

const crawler = new Crawler('https://docs.example.com');

// crawl(maxPages, onUrlFound callback)
const pages: Page[] = await crawler.crawl(50, (url) => {
    console.log(`Discovered: ${url}`);
});

The Processor Class

The Processor cleans HTML, converts it to Markdown using Turndown, and optionally enhances it via LLM.

import { Processor } from './src/processor';

const processor = new Processor(
    process.env.OPENAI_API_KEY, 
    'https://api.openai.com/v1', 
    'gpt-4o-mini'
);

// processPage(page, isRawMode)
const agenticMarkdown = await processor.processPage(pages[0], false);

Core Architecture

Content Normalization

The Processor includes a cleanHtml method (internal) that automatically strips noise before conversion. It removes:

<script>, <style>, <iframe>, and <noscript> tags.
Navigation elements (<nav>, [role="navigation"], .nav).
Footers and Sidebars (<footer>, .footer, .sidebar).
It prioritizes content within <main> or <article> tags.

LLM Orchestration

When not in --raw mode, the Processor uses a high-density prompt to transform standard Markdown into "Agentic Context." This involves:

Stripping conversational filler.
Hard-preserving code signatures and technical constraints.
Optimizing for vector retrieval (RAG).

Development Workflow

Running in Development

To run the CLI directly from source without manual transpilation:

npx ts-node src/index.ts https://docs.example.com --raw

Extending Processing Logic

If you need to add custom cleaning rules (e.g., removing specific cookie banners or headers found in a particular documentation framework), modify the cleanHtml method in src/processor.ts:

// Example: Adding a custom selector to strip
private cleanHtml(html: string): string {
    const $ = cheerio.load(html);
    $('.custom-cookie-banner').remove();
    $('.v-announcement-bar').remove();
    // ... existing cleaning logic
}

Building for Production

To generate the production-ready JavaScript files in the dist directory:

npm run build

After building, you can link the package locally to test the global moltext command:

npm link