Project Roadmap

🗺️ Project Roadmap

Moltext is evolving from a single-URL compiler into a universal ingestion engine for agentic workflows. Our roadmap focuses on expanding the data sources available to agents and improving the structural density of the output.

📍 Phase 1: Expanded Format Support

While HTML is the primary medium for documentation, critical technical specifications often live in static files.

PDF Ingestion: Support for parsing and normalizing PDF manuals, whitepapers, and API specifications into the context.md format.
Markdown Native Support: Direct ingestion of existing .md and .mdx repositories (e.g., GitHub docs/ folders) to bypass scraping overhead.
OpenAPI/Swagger Integration: Specialized processing for JSON/YAML spec files to generate high-density API reference tables.

📍 Phase 2: Structural Intelligence

Improving how Moltext handles complex, fragmented information across ecosystems.

Multi-Domain Documentation Mapping: Currently, the crawler stays within a single domain. Future updates will allow "Trusted Domain" lists, enabling agents to follow documentation that spans multiple sites (e.g., a core library site and its associated plugin ecosystem).
Cross-Reference Normalization: Automatically resolving and rewriting relative links within the context.md to maintain internal consistency when the agent navigates the compiled memory.
Sitemap-Aware Crawling: Faster discovery of documentation structures by prioritizing sitemap.xml and robots.txt paths.

📍 Phase 3: Agent-Centric Optimizations

Features designed to lower latency and improve retrieval-augmented generation (RAG) performance.

Context Window Chunking: Automatic splitting of context.md into model-optimized chunks (e.g., 32k or 128k blocks) with overlapping headers to prevent context loss.
Local Embedding Generation: An optional flag to output a .vector or .jsonl file alongside the markdown, ready for immediate injection into vector databases like Chroma or Pinecone.
Incremental Updates: A "watch" mode that re-compiles only the pages that have changed since the last run, reducing LLM token consumption.

To suggest a feature or report a bug, please open an issue in the GitHub Repository.