Rate Limiting & Safety

Crawling Limits and Safety

Moltext is designed to ingest large documentation sites while maintaining respect for the target host's resources and your own API quotas. By default, the tool includes several safeguards to prevent runaway processes and excessive consumption.

Managing Crawl Depth

To prevent Moltext from attempting to crawl an entire domain or getting stuck in infinite link loops, use the --limit flag.

Flag: -l, --limit <number>
Default: 100 pages
Behavior: Once the crawler identifies and parses the specified number of unique pages, it stops discovery and begins the normalization process.

# Limit the crawl to the first 20 pages for a quick summary
moltext https://docs.example.com --limit 20

Domain Isolation

The crawler is strictly locked to the hostname of the initial URL provided.

Internal Links: Only links belonging to the same domain (e.g., docs.example.com) are added to the queue.
External Links: Links pointing to GitHub, Twitter, or other external domains are ignored during the discovery phase to ensure the resulting context.md remains focused on the specific tool or library.

API Rate Limiting & Batching

When using LLM-enhanced mode (non-raw), Moltext manages requests to your inference provider (OpenAI or local) to avoid 429 Too Many Requests errors.

Sequential Batching: Moltext processes pages in batches of 5 concurrent requests. This provides a balance between compilation speed and rate-limit safety.
Raw Mode: To bypass LLM API consumption entirely, use the --raw flag. This is recommended for massive documentation sets where you intend to perform your own chunking or vectorization later.

# Skip LLM processing to avoid API costs and rate limits
moltext https://docs.example.com --raw

Local Inference for Unlimited Processing

For high-volume ingestion without cost or rate-limit constraints, you can point Moltext to a local inference server (such as Ollama or LM Studio) using the --base-url option.

# Using a local Ollama instance (no API key required)
moltext https://docs.example.com --base-url http://localhost:11434/v1 --model llama3

Network Safety

Timeouts: Each network request to the documentation source has a hard-coded timeout of 10,000ms (10 seconds). If a page fails to respond within this window, the crawler skips it and moves to the next item in the queue.
User-Agent: Requests are identified by the User-Agent Moltext/1.0, allowing site administrators to identify the traffic.
Content Filtering: Moltext only attempts to parse text/html resources. Binary files, images, and PDFs are automatically ignored during the crawl.