Transmutation

High-performance document conversion engine for AI/LLM embeddings

Transmutation is a pure Rust document conversion engine designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, Transmutation is a high-performance alternative to Docling, offering superior speed, lower memory usage, and zero runtime dependencies.

🎯 Project Goals

Pure Rust implementation - No Python dependencies, maximum performance
Convert documents to LLM-friendly formats (Markdown, Images, JSON)
Optimize output for embedding generation (text and multimodal)
Maintain maximum quality with minimum size
Competitor to Docling - 98x faster, more efficient, and easier to deploy
Seamless integration with HiveLLM Vectorizer

📊 Benchmark Results

Transmutation vs Docling (Fast Mode - Pure Rust):

Metric	Paper 1 (15 pages)	Paper 2 (25 pages)	Average
Similarity	76.36%	84.44%	80.40%
Speed	108x faster	88x faster	98x faster
Time (Docling)	31.36s	40.56s	~35s
Time (Transmutation)	0.29s	0.46s	~0.37s

✅ 80% similarity - Acceptable for most use cases
✅ 98x faster - Near-instant conversion
✅ Pure Rust - No Python/ML dependencies
✅ Low memory - 50 MB footprint
🎯 Goal: 95% similarity (Precision Mode with C++ FFI - in development)

See BENCHMARK_COMPARISON.md for detailed results.

📋 Supported Formats

Document Formats

Input Format	Output Options	Status	Modes
PDF	Image per page, Markdown (per page/full), JSON	✅ Production	Fast, Precision, FFI
DOCX	Image per page, Markdown (per page/full), JSON	✅ Production	Pure Rust + LibreOffice
XLSX	Markdown tables, CSV, JSON	✅ Production	Pure Rust (148 pg/s)
PPTX	Image per slide, Markdown per slide	✅ Production	Pure Rust (1639 pg/s)
HTML	Markdown, JSON	✅ Production	Pure Rust (2110 pg/s)
XML	Markdown, JSON	✅ Production	Pure Rust (2353 pg/s)
TXT	Markdown, JSON	✅ Production	Pure Rust (2805 pg/s)
CSV/TSV	Markdown tables, JSON	✅ Production	Pure Rust (2647 pg/s)
RTF	Markdown, JSON	⚠️ Beta	Pure Rust (simplified parser)
ODT	Markdown, JSON	⚠️ Beta	Pure Rust (ZIP + XML)
MD	Markdown (normalized), JSON	🔄 Planned	-

Image Formats (OCR)

Input Format	Output Options	OCR Engine	Status
JPG/JPEG	Markdown (OCR), JSON	Tesseract	✅ Production
PNG	Markdown (OCR), JSON	Tesseract	✅ Production
TIFF/TIF	Markdown (OCR), JSON	Tesseract	✅ Production
BMP	Markdown (OCR), JSON	Tesseract	✅ Production
GIF	Markdown (OCR), JSON	Tesseract	✅ Production
WEBP	Markdown (OCR), JSON	Tesseract	✅ Production

Audio/Video Formats

Input Format	Output Options	Engine	Status
MP3	Markdown (transcription), JSON	Whisper	✅ Production
WAV	Markdown (transcription), JSON	Whisper	✅ Production
M4A	Markdown (transcription), JSON	Whisper	✅ Production
FLAC	Markdown (transcription), JSON	Whisper	✅ Production
OGG	Markdown (transcription), JSON	Whisper	✅ Production
MP4	Markdown (transcription), JSON	FFmpeg + Whisper	✅ Production
AVI	Markdown (transcription), JSON	FFmpeg + Whisper	✅ Production
MKV	Markdown (transcription), JSON	FFmpeg + Whisper	✅ Production
MOV	Markdown (transcription), JSON	FFmpeg + Whisper	✅ Production
WEBM	Markdown (transcription), JSON	FFmpeg + Whisper	✅ Production

Archive Formats

Input Format	Output Options	Status	Performance
ZIP	File listing, statistics, Markdown index, JSON	✅ Production	Pure Rust (1864 pg/s)
TAR/GZ	Extract and process contents	🔄 Planned	-
7Z	Extract and process contents	🔄 Planned	-

🚀 Quick Start

Installation

Windows MSI Installer:

# Download from releases or build:
.\build-msi.ps1
msiexec /i target\wix\transmutation-0.3.0-x86_64.msi

See docs/MSI_BUILD.md for details.

Cargo:

# Add to Cargo.toml
[dependencies]
transmutation = "0.2"

# Core features (always enabled, no flags needed):
# - PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT

# With Office formats (default)
[dependencies.transmutation]
version = "0.2"
features = ["office"]  # DOCX, XLSX, PPTX

# With optional features (requires external tools)
features = ["office", "pdf-to-image", "tesseract", "audio"]

External Dependencies

Transmutation is mostly pure Rust, with core features requiring ZERO dependencies:

Feature	Requires	Status
Core (PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT)	✅ None	Always enabled
`office` (DOCX, XLSX, PPTX - Text)	✅ None	Pure Rust (default)
`pdf-to-image`	⚠️ poppler-utils	Optional
`office` + images	⚠️ LibreOffice	Optional
`image-ocr`	⚠️ Tesseract OCR	Optional
`audio`	⚠️ Whisper CLI	Optional
`video`	⚠️ FFmpeg + Whisper	Optional
`archives-extended` (TAR, GZ, 7Z)	⚠️ tar, flate2 crates	Optional

During compilation, build.rs will automatically detect missing dependencies and provide installation instructions:

cargo build --features "pdf-to-image"

# If pdftoppm is missing, you'll see:
⚠️  Optional External Dependencies Missing

  ❌ pdftoppm (poppler-utils): PDF → Image conversion
     Install: sudo apt-get install poppler-utils

📖 Quick install (all dependencies):
   ./install/install-deps-linux.sh

Installation scripts are provided for all platforms:

Linux: ./install/install-deps-linux.sh
macOS: ./install/install-deps-macos.sh
Windows: .\install\install-deps-windows.ps1 (or .bat)

See install/README.md for detailed instructions.

📖 Usage Guide

CLI Usage

Basic Conversion:

# Convert PDF to Markdown
transmutation convert document.pdf -o output.md

# Convert DOCX to Markdown with images
transmutation convert report.docx -o output.md --extract-images

# Convert with precision mode (77% similarity)
transmutation convert paper.pdf -o output.md --precision

# Convert multiple files
transmutation batch *.pdf -o output/ --parallel 4

Format-Specific Examples:

# PDF → Markdown (split by pages)
transmutation convert document.pdf -o output/ --split-pages

# DOCX → Markdown + Images
transmutation convert report.docx -o output.md --images

# XLSX → CSV
transmutation convert data.xlsx -o output.csv --format csv

# PPTX → Markdown (one file per slide)
transmutation convert slides.pptx -o output/ --split-slides

# Image OCR → Markdown
transmutation convert scan.jpg -o output.md --ocr --lang eng

# ZIP → Extract and convert all
transmutation convert archive.zip -o output/ --recursive

Advanced Options:

# Optimize for LLM embeddings
transmutation convert document.pdf \
  --optimize-llm \
  --max-chunk-size 512 \
  --remove-headers \
  --normalize-whitespace

# High-quality image extraction
transmutation convert document.pdf \
  --extract-images \
  --dpi 300 \
  --image-quality high

# Batch processing with progress
transmutation batch papers/*.pdf \
  -o converted/ \
  --parallel 8 \
  --progress \
  --format markdown

Library Usage (Rust)

Basic Conversion:

use transmutation::{Converter, OutputFormat, ConversionOptions};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize converter
    let converter = Converter::new()?;
    
    // Convert PDF to Markdown
    let result = converter
        .convert("document.pdf")
        .to(OutputFormat::Markdown)
        .with_options(ConversionOptions {
            split_pages: true,
            optimize_for_llm: true,
            ..Default::default()
        })
        .execute()
        .await?;
    
    // Save output
    result.save("output/document.md").await?;
    
    println!("Converted {} pages", result.page_count());
    Ok(())
}

Batch Processing

use transmutation::{Converter, BatchProcessor, OutputFormat};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let converter = Converter::new()?;
    let batch = BatchProcessor::new(converter);
    
    // Process multiple files
    let results = batch
        .add_files(&["doc1.pdf", "doc2.docx", "doc3.pptx"])
        .to(OutputFormat::Markdown)
        .parallel(4)
        .execute()
        .await?;
    
    for (file, result) in results {
        println!("{}: {} -> {}", file, result.input_size(), result.output_size());
    }
    
    Ok(())
}

Vectorizer Integration

use transmutation::{Converter, OutputFormat};
use vectorizer::VectorizerClient;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let converter = Converter::new()?;
    let vectorizer = VectorizerClient::new("http://localhost:15002").await?;
    
    // Convert and embed in one pipeline
    let result = converter
        .convert("document.pdf")
        .to(OutputFormat::EmbeddingReady)
        .pipe_to(&vectorizer)
        .execute()
        .await?;
    
    println!("Embedded {} chunks", result.chunk_count());
    Ok(())
}

Python Usage (PyO3 Bindings - Future)

from transmutation import Converter, OutputFormat

# Initialize converter
converter = Converter()

# Convert PDF to Markdown
result = converter.convert(
    "document.pdf",
    output_format=OutputFormat.Markdown,
    split_pages=True,
    optimize_for_llm=True
)

result.save("output/document.md")
print(f"Converted {result.page_count()} pages")

# Batch processing
from transmutation import BatchProcessor

batch = BatchProcessor(converter)
results = batch.add_files([
    "doc1.pdf",
    "doc2.docx",
    "doc3.pptx"
]).to(OutputFormat.Markdown).parallel(4).execute()

for file, result in results:
    print(f"{file}: {result.input_size()} -> {result.output_size()}")

JavaScript/TypeScript (Neon Bindings - Future)

import { Converter, OutputFormat, ConversionOptions } from 'transmutation';

// Initialize converter
const converter = new Converter();

// Convert PDF to Markdown
const result = await converter
  .convert('document.pdf')
  .to(OutputFormat.Markdown)
  .withOptions({
    splitPages: true,
    optimizeForLlm: true,
    extractImages: false
  })
  .execute();

await result.save('output/document.md');
console.log(`Converted ${result.pageCount()} pages`);

// Batch processing
import { BatchProcessor } from 'transmutation';

const batch = new BatchProcessor(converter);
const results = await batch
  .addFiles(['doc1.pdf', 'doc2.docx', 'doc3.pptx'])
  .to(OutputFormat.Markdown)
  .parallel(4)
  .execute();

results.forEach(([file, result]) => {
  console.log(`${file}: ${result.inputSize()} -> ${result.outputSize()}`);
});

🎯 Common Use Cases

1. RAG System Document Ingestion

# Convert research papers for semantic search
transmutation batch papers/*.pdf \
  -o embeddings/ \
  --optimize-llm \
  --split-pages \
  --max-chunk-size 512 \
  --parallel 8

# Then index with Vectorizer
vectorizer insert --collection research_papers embeddings/*.md

2. Document Archive Migration

# Convert legacy documents to Markdown
transmutation batch archive/ \
  -o markdown/ \
  --recursive \
  --format markdown \
  --parallel 16 \
  --progress

# Supported: PDF, DOCX, XLSX, PPTX, RTF, ODT, HTML, XML

3. OCR for Scanned Documents

# Batch OCR with Tesseract
transmutation batch scans/*.jpg \
  -o text/ \
  --ocr \
  --lang eng \
  --dpi 300 \
  --parallel 4

# Multi-language support
transmutation convert document_pt.jpg \
  -o output.md \
  --ocr \
  --lang por

4. Legal Document Processing

# Convert legal PDFs with high precision
transmutation convert contract.pdf \
  -o contract.md \
  --precision \
  --preserve-layout \
  --extract-tables \
  --include-metadata

# Batch process court documents
transmutation batch cases/*.pdf \
  -o processed/ \
  --precision \
  --parallel 4

5. Academic Paper Analysis

# Extract text from arXiv papers
transmutation batch papers/*.pdf \
  -o markdown/ \
  --split-pages \
  --extract-tables \
  --normalize-whitespace

# Create embeddings for similarity search
vectorizer insert --collection arxiv markdown/*.md

6. Data Extraction from Spreadsheets

# Convert Excel to Markdown tables
transmutation convert data.xlsx -o tables.md --format markdown

# Convert to CSV for analysis
transmutation convert data.xlsx -o data.csv --format csv

# Convert to JSON
transmutation convert data.xlsx -o data.json --format json

7. Presentation Content Extraction

# Extract text from PowerPoint slides
transmutation convert presentation.pptx \
  -o slides/ \
  --split-slides \
  --extract-images \
  --format markdown

# Batch process training materials
transmutation batch trainings/*.pptx \
  -o content/ \
  --split-slides \
  --parallel 8

8. Web Content Archiving

# Convert saved HTML pages
transmutation batch pages/*.html \
  -o markdown/ \
  --format markdown \
  --normalize-whitespace

# Process downloaded documentation
transmutation batch docs/*.html \
  -o processed/ \
  --extract-images \
  --parallel 4

🔧 Configuration

Conversion Options

pub struct ConversionOptions {
    // Output control
    pub split_pages: bool,           // Split output by pages
    pub optimize_for_llm: bool,      // Optimize for LLM processing
    pub max_chunk_size: usize,       // Maximum chunk size (tokens)
    
    // Quality settings
    pub image_quality: ImageQuality, // High, Medium, Low
    pub dpi: u32,                    // DPI for image output (default: 150)
    pub ocr_language: String,        // OCR language (default: "eng")
    
    // Processing options
    pub preserve_layout: bool,       // Preserve document layout
    pub extract_tables: bool,        // Extract tables separately
    pub extract_images: bool,        // Extract embedded images
    pub include_metadata: bool,      // Include document metadata
    
    // Optimization
    pub compression_level: u8,       // 0-9 for output compression
    pub remove_headers_footers: bool,
    pub remove_watermarks: bool,
    pub normalize_whitespace: bool,
}

🆚 Why Transmutation vs Docling?

Feature	Transmutation	Docling
Language	100% Rust	Python
Performance	✅ 250x faster	Baseline
Memory Usage	✅ ~20MB	~2-3GB
Dependencies	✅ Zero runtime deps	Python + ML models
Deployment	✅ Single binary (~5MB)	Python env + models (~2GB)
Startup Time	✅ <100ms	~5-10s
Platform Support	✅ Windows/Mac/Linux	Requires Python

LLM Framework Integrations

LangChain: Document loaders and text splitters
LlamaIndex: Document readers and node parsers
Haystack: Document converters and preprocessors
DSPy: Optimized document processing

📊 Performance

Real-World Benchmarks ✅

Test Document: Attention Is All You Need (arXiv:1706.03762v7.pdf)
Size: 2.22 MB, 15 pages

Metric	Transmutation	Docling	Improvement
Conversion Time	0.21s	52.68s	✅ 250x faster
Processing Speed	71 pages/sec	0.28 pages/sec	✅ 254x faster
Memory Usage	~20MB	~2-3GB	✅ 100-150x less
Startup Time	<0.1s	~6s	✅ 60x faster
Output Quality (Fast)	71.8% similarity	100% (reference)	⚠️ Trade-off
Output Quality (Precision)	77.3% similarity	100% (reference)	⚠️ +5.5% better

Projected Performance

Operation	Input Size	Time	Throughput
PDF → Markdown	2.2MB (15 pages)	0.21s	71 pages/s ✅
PDF → Markdown	10MB (100 pages)	~1.4s	71 pages/s
Batch (1,000 PDFs)	2.2GB (15,000 pages)	~4 min	3,750 pages/min

Memory Footprint

Base: ~20MB (pure Rust, no Python runtime) ✅
Per conversion: Minimal (streaming processing)
No ML models required (unlike Docling's 2-3GB)

Precision vs Performance Trade-off

Fast Mode (default) - 71.8% similarity:

✅ 250x faster than Docling
✅ Pure Rust with basic text heuristics
✅ Works on any PDF without training
✅ Zero runtime dependencies

Precision Mode (--precision) - 77.3% similarity:

✅ 250x faster than Docling (same speed as fast mode)
✅ Enhanced text processing with space correction
✅ +5.5% better than fast mode
✅ No hardcoded rules, all generic heuristics

Why not 95%+ similarity?

Docling uses:

docling-parse (C++ library) - Extracts text with precise coordinates, fonts, and layout info
LayoutModel (ML) - Deep learning to detect block types (headings, paragraphs, tables) visually
ReadingOrderModel (ML) - ML-based reading order determination

Transmutation provides three modes:

1. Fast Mode (default):

Pure Rust text extraction (pdf-extract)
Generic heuristics (no ML)
71.8% similarity, 250x faster

2. Precision Mode (--precision):

Enhanced text processing
Generic heuristics + space correction
77.3% similarity, 250x faster

Future: C++ FFI Mode - Direct integration with docling-parse (no Python):

Will use C++ library via FFI for 95%+ similarity
No Python dependency, pure Rust + C++ shared library
In development

Mode	Similarity	Speed	Memory	Dependencies
Fast	71.8%	250x	50 MB	None (pure Rust)
Precision	77.3%	250x	50 MB	None (pure Rust)
FFI (future)	95%+	~50x	100 MB	C++ shared lib only

🛣️ Roadmap

See ROADMAP.md for detailed development plan.

Phase 1: Foundation (Q1 2025) ✅ COMPLETE

✅ Project structure and architecture
✅ Core converter interfaces
✅ PDF conversion (pure Rust - pdf-extract)
✅ Advanced Markdown output with intelligent paragraph joining
✅ 98x faster than Docling benchmark achieved (97 papers tested)

Phase 1.5: Distribution & Tooling (Oct 2025) ✅ COMPLETE

✅ Windows MSI installer with dependency management
✅ Custom icons and professional branding
✅ Multi-platform installation scripts (5 variants)
✅ Build-time dependency detection
✅ Comprehensive documentation

Phase 2: Core Formats (Q2 2025) ✅ 100% COMPLETE

✅ DOCX conversion (Markdown + Images - Pure Rust)
✅ XLSX conversion (Markdown/CSV/JSON - Pure Rust, 148 pg/s)
✅ PPTX conversion (Markdown/Images - Pure Rust, 1639 pg/s)
✅ HTML/XML conversion (Pure Rust, 2110-2353 pg/s)
✅ Text formats (TXT, CSV, TSV, RTF, ODT - Pure Rust)
✅ 11 formats total (8 production, 2 beta)

Phase 2.5: Core Features Architecture ✅ COMPLETE

✅ Core formats always enabled (no feature flags)
✅ Simplified API and user experience
✅ Faster compilation

Phase 3: Advanced Features (Q3 2025) ✅ COMPLETE

✅ Archive handling (ZIP, TAR, TAR.GZ - 1864 pg/s)
✅ Batch processing (Concurrent with Tokio - 4,627 pg/s)
✅ Image OCR (Tesseract - 6 formats, 88x faster than Docling)

Phase 4: Advanced Optimizations

📝 Performance optimizations
📝 Quality improvements (RTF, ODT)
📝 Memory optimizations
📝 v1.0.0 Release

🤝 Contributing

See CONTRIBUTING.md for guidelines.

📝 License

MIT License - see LICENSE for details.

📝 Changelog

See CHANGELOG.md for detailed version history and release notes.

Current Version: 0.3.0 (December 6, 2025)

🔗 Links

GitHub: https://github.com/hivellm/transmutation
Documentation: https://docs.hivellm.org/transmutation
Changelog: CHANGELOG.md
Docling Project: https://github.com/docling-project
HiveLLM Vectorizer: https://github.com/hivellm/vectorizer

🏆 Credits

Built with ❤️ by the HiveLLM Team

Pure Rust implementation - No Python, no ML model dependencies

Powered by:

lopdf - Pure Rust PDF parsing
docx-rs - Pure Rust DOCX parsing
Tesseract - OCR engine (optional)
FFmpeg - Multimedia processing (optional)

Inspired by Docling, but built to be faster, lighter, and easier to deploy.

Status: ✅ v0.3.0 - Performance & Memory Optimization Release

Latest Updates (v0.3.0):

⚡ Memory Optimization: Cached regex patterns, pre-allocated buffers
🔧 Fixed O(n²) Issue: Page extraction now O(n) for split-pages mode
🚀 Reduced Memory Pressure: Early release of PDF bytes after extraction
📉 Lower Memory Footprint: Especially beneficial for library usage

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.cargo		.cargo
.cursor/commands		.cursor/commands
.github		.github
assets		assets
cpp		cpp
data		data
debian		debian
docs		docs
examples		examples
install		install
models		models
pkg/appimage		pkg/appimage
rulebook		rulebook
scripts		scripts
src		src
tests		tests
wix		wix
.clippy.toml		.clippy.toml
.cursorrules		.cursorrules
.dockerignore		.dockerignore
.gitignore		.gitignore
.rulebook		.rulebook
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
clippy.toml		clippy.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

License

hivellm/transmutation

Folders and files

Latest commit

History

Repository files navigation