Rate Limiting & Safety
Crawling Limits and Safety
Moltext is designed to ingest large documentation sites while maintaining respect for the target host's resources and your own API quotas. By default, the tool includes several safeguards to prevent runaway processes and excessive consumption.
Managing Crawl Depth
To prevent Moltext from attempting to crawl an entire domain or getting stuck in infinite link loops, use the --limit flag.
- Flag:
-l, --limit <number> - Default:
100pages - Behavior: Once the crawler identifies and parses the specified number of unique pages, it stops discovery and begins the normalization process.
# Limit the crawl to the first 20 pages for a quick summary
moltext https://docs.example.com --limit 20
Domain Isolation
The crawler is strictly locked to the hostname of the initial URL provided.
- Internal Links: Only links belonging to the same domain (e.g.,
docs.example.com) are added to the queue. - External Links: Links pointing to GitHub, Twitter, or other external domains are ignored during the discovery phase to ensure the resulting
context.mdremains focused on the specific tool or library.
API Rate Limiting & Batching
When using LLM-enhanced mode (non-raw), Moltext manages requests to your inference provider (OpenAI or local) to avoid 429 Too Many Requests errors.
- Sequential Batching: Moltext processes pages in batches of 5 concurrent requests. This provides a balance between compilation speed and rate-limit safety.
- Raw Mode: To bypass LLM API consumption entirely, use the
--rawflag. This is recommended for massive documentation sets where you intend to perform your own chunking or vectorization later.
# Skip LLM processing to avoid API costs and rate limits
moltext https://docs.example.com --raw
Local Inference for Unlimited Processing
For high-volume ingestion without cost or rate-limit constraints, you can point Moltext to a local inference server (such as Ollama or LM Studio) using the --base-url option.
# Using a local Ollama instance (no API key required)
moltext https://docs.example.com --base-url http://localhost:11434/v1 --model llama3
Network Safety
- Timeouts: Each network request to the documentation source has a hard-coded timeout of 10,000ms (10 seconds). If a page fails to respond within this window, the crawler skips it and moves to the next item in the queue.
- User-Agent: Requests are identified by the User-Agent
Moltext/1.0, allowing site administrators to identify the traffic. - Content Filtering: Moltext only attempts to parse
text/htmlresources. Binary files, images, and PDFs are automatically ignored during the crawl.