High-performance document conversion engine for AI/LLM embeddings
Transmutation is a pure Rust document conversion engine designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, Transmutation is a high-performance alternative to Docling, offering superior speed, lower memory usage, and zero runtime dependencies.
- Pure Rust implementation - No Python dependencies, maximum performance
- Convert documents to LLM-friendly formats (Markdown, Images, JSON)
- Optimize output for embedding generation (text and multimodal)
- Maintain maximum quality with minimum size
- Competitor to Docling - 98x faster, more efficient, and easier to deploy
- Seamless integration with HiveLLM Vectorizer
Transmutation vs Docling (Fast Mode - Pure Rust):
| Metric | Paper 1 (15 pages) | Paper 2 (25 pages) | Average |
|---|---|---|---|
| Similarity | 76.36% | 84.44% | 80.40% |
| Speed | 108x faster | 88x faster | 98x faster |
| Time (Docling) | 31.36s | 40.56s | ~35s |
| Time (Transmutation) | 0.29s | 0.46s | ~0.37s |
- β 80% similarity - Acceptable for most use cases
- β 98x faster - Near-instant conversion
- β Pure Rust - No Python/ML dependencies
- β Low memory - 50 MB footprint
- π― Goal: 95% similarity (Precision Mode with C++ FFI - in development)
See BENCHMARK_COMPARISON.md for detailed results.
| Input Format | Output Options | Status | Modes |
|---|---|---|---|
| Image per page, Markdown (per page/full), JSON | β Production | Fast, Precision, FFI | |
| DOCX | Image per page, Markdown (per page/full), JSON | β Production | Pure Rust + LibreOffice |
| XLSX | Markdown tables, CSV, JSON | β Production | Pure Rust (148 pg/s) |
| PPTX | Image per slide, Markdown per slide | β Production | Pure Rust (1639 pg/s) |
| HTML | Markdown, JSON | β Production | Pure Rust (2110 pg/s) |
| XML | Markdown, JSON | β Production | Pure Rust (2353 pg/s) |
| TXT | Markdown, JSON | β Production | Pure Rust (2805 pg/s) |
| CSV/TSV | Markdown tables, JSON | β Production | Pure Rust (2647 pg/s) |
| RTF | Markdown, JSON | Pure Rust (simplified parser) | |
| ODT | Markdown, JSON | Pure Rust (ZIP + XML) | |
| MD | Markdown (normalized), JSON | π Planned | - |
| Input Format | Output Options | OCR Engine | Status |
|---|---|---|---|
| JPG/JPEG | Markdown (OCR), JSON | Tesseract | β Production |
| PNG | Markdown (OCR), JSON | Tesseract | β Production |
| TIFF/TIF | Markdown (OCR), JSON | Tesseract | β Production |
| BMP | Markdown (OCR), JSON | Tesseract | β Production |
| GIF | Markdown (OCR), JSON | Tesseract | β Production |
| WEBP | Markdown (OCR), JSON | Tesseract | β Production |
| Input Format | Output Options | Engine | Status |
|---|---|---|---|
| MP3 | Markdown (transcription), JSON | Whisper | β Production |
| WAV | Markdown (transcription), JSON | Whisper | β Production |
| M4A | Markdown (transcription), JSON | Whisper | β Production |
| FLAC | Markdown (transcription), JSON | Whisper | β Production |
| OGG | Markdown (transcription), JSON | Whisper | β Production |
| MP4 | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
| AVI | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
| MKV | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
| MOV | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
| WEBM | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
| Input Format | Output Options | Status | Performance |
|---|---|---|---|
| ZIP | File listing, statistics, Markdown index, JSON | β Production | Pure Rust (1864 pg/s) |
| TAR/GZ | Extract and process contents | π Planned | - |
| 7Z | Extract and process contents | π Planned | - |
Windows MSI Installer:
# Download from releases or build:
.\build-msi.ps1
msiexec /i target\wix\transmutation-0.3.0-x86_64.msiSee docs/MSI_BUILD.md for details.
Cargo:
# Add to Cargo.toml
[dependencies]
transmutation = "0.2"
# Core features (always enabled, no flags needed):
# - PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT
# With Office formats (default)
[dependencies.transmutation]
version = "0.2"
features = ["office"] # DOCX, XLSX, PPTX
# With optional features (requires external tools)
features = ["office", "pdf-to-image", "tesseract", "audio"]Transmutation is mostly pure Rust, with core features requiring ZERO dependencies:
| Feature | Requires | Status |
|---|---|---|
| Core (PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT) | β None | Always enabled |
office (DOCX, XLSX, PPTX - Text) |
β None | Pure Rust (default) |
pdf-to-image |
Optional | |
office + images |
Optional | |
image-ocr |
Optional | |
audio |
Optional | |
video |
Optional | |
archives-extended (TAR, GZ, 7Z) |
Optional |
During compilation, build.rs will automatically detect missing dependencies and provide installation instructions:
cargo build --features "pdf-to-image"
# If pdftoppm is missing, you'll see:
β οΈ Optional External Dependencies Missing
β pdftoppm (poppler-utils): PDF β Image conversion
Install: sudo apt-get install poppler-utils
π Quick install (all dependencies):
./install/install-deps-linux.shInstallation scripts are provided for all platforms:
- Linux:
./install/install-deps-linux.sh - macOS:
./install/install-deps-macos.sh - Windows:
.\install\install-deps-windows.ps1(or.bat)
See install/README.md for detailed instructions.
Basic Conversion:
# Convert PDF to Markdown
transmutation convert document.pdf -o output.md
# Convert DOCX to Markdown with images
transmutation convert report.docx -o output.md --extract-images
# Convert with precision mode (77% similarity)
transmutation convert paper.pdf -o output.md --precision
# Convert multiple files
transmutation batch *.pdf -o output/ --parallel 4Format-Specific Examples:
# PDF β Markdown (split by pages)
transmutation convert document.pdf -o output/ --split-pages
# DOCX β Markdown + Images
transmutation convert report.docx -o output.md --images
# XLSX β CSV
transmutation convert data.xlsx -o output.csv --format csv
# PPTX β Markdown (one file per slide)
transmutation convert slides.pptx -o output/ --split-slides
# Image OCR β Markdown
transmutation convert scan.jpg -o output.md --ocr --lang eng
# ZIP β Extract and convert all
transmutation convert archive.zip -o output/ --recursiveAdvanced Options:
# Optimize for LLM embeddings
transmutation convert document.pdf \
--optimize-llm \
--max-chunk-size 512 \
--remove-headers \
--normalize-whitespace
# High-quality image extraction
transmutation convert document.pdf \
--extract-images \
--dpi 300 \
--image-quality high
# Batch processing with progress
transmutation batch papers/*.pdf \
-o converted/ \
--parallel 8 \
--progress \
--format markdownBasic Conversion:
use transmutation::{Converter, OutputFormat, ConversionOptions};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize converter
let converter = Converter::new()?;
// Convert PDF to Markdown
let result = converter
.convert("document.pdf")
.to(OutputFormat::Markdown)
.with_options(ConversionOptions {
split_pages: true,
optimize_for_llm: true,
..Default::default()
})
.execute()
.await?;
// Save output
result.save("output/document.md").await?;
println!("Converted {} pages", result.page_count());
Ok(())
}use transmutation::{Converter, BatchProcessor, OutputFormat};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let converter = Converter::new()?;
let batch = BatchProcessor::new(converter);
// Process multiple files
let results = batch
.add_files(&["doc1.pdf", "doc2.docx", "doc3.pptx"])
.to(OutputFormat::Markdown)
.parallel(4)
.execute()
.await?;
for (file, result) in results {
println!("{}: {} -> {}", file, result.input_size(), result.output_size());
}
Ok(())
}use transmutation::{Converter, OutputFormat};
use vectorizer::VectorizerClient;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let converter = Converter::new()?;
let vectorizer = VectorizerClient::new("http://localhost:15002").await?;
// Convert and embed in one pipeline
let result = converter
.convert("document.pdf")
.to(OutputFormat::EmbeddingReady)
.pipe_to(&vectorizer)
.execute()
.await?;
println!("Embedded {} chunks", result.chunk_count());
Ok(())
}from transmutation import Converter, OutputFormat
# Initialize converter
converter = Converter()
# Convert PDF to Markdown
result = converter.convert(
"document.pdf",
output_format=OutputFormat.Markdown,
split_pages=True,
optimize_for_llm=True
)
result.save("output/document.md")
print(f"Converted {result.page_count()} pages")
# Batch processing
from transmutation import BatchProcessor
batch = BatchProcessor(converter)
results = batch.add_files([
"doc1.pdf",
"doc2.docx",
"doc3.pptx"
]).to(OutputFormat.Markdown).parallel(4).execute()
for file, result in results:
print(f"{file}: {result.input_size()} -> {result.output_size()}")import { Converter, OutputFormat, ConversionOptions } from 'transmutation';
// Initialize converter
const converter = new Converter();
// Convert PDF to Markdown
const result = await converter
.convert('document.pdf')
.to(OutputFormat.Markdown)
.withOptions({
splitPages: true,
optimizeForLlm: true,
extractImages: false
})
.execute();
await result.save('output/document.md');
console.log(`Converted ${result.pageCount()} pages`);
// Batch processing
import { BatchProcessor } from 'transmutation';
const batch = new BatchProcessor(converter);
const results = await batch
.addFiles(['doc1.pdf', 'doc2.docx', 'doc3.pptx'])
.to(OutputFormat.Markdown)
.parallel(4)
.execute();
results.forEach(([file, result]) => {
console.log(`${file}: ${result.inputSize()} -> ${result.outputSize()}`);
});# Convert research papers for semantic search
transmutation batch papers/*.pdf \
-o embeddings/ \
--optimize-llm \
--split-pages \
--max-chunk-size 512 \
--parallel 8
# Then index with Vectorizer
vectorizer insert --collection research_papers embeddings/*.md# Convert legacy documents to Markdown
transmutation batch archive/ \
-o markdown/ \
--recursive \
--format markdown \
--parallel 16 \
--progress
# Supported: PDF, DOCX, XLSX, PPTX, RTF, ODT, HTML, XML# Batch OCR with Tesseract
transmutation batch scans/*.jpg \
-o text/ \
--ocr \
--lang eng \
--dpi 300 \
--parallel 4
# Multi-language support
transmutation convert document_pt.jpg \
-o output.md \
--ocr \
--lang por# Convert legal PDFs with high precision
transmutation convert contract.pdf \
-o contract.md \
--precision \
--preserve-layout \
--extract-tables \
--include-metadata
# Batch process court documents
transmutation batch cases/*.pdf \
-o processed/ \
--precision \
--parallel 4# Extract text from arXiv papers
transmutation batch papers/*.pdf \
-o markdown/ \
--split-pages \
--extract-tables \
--normalize-whitespace
# Create embeddings for similarity search
vectorizer insert --collection arxiv markdown/*.md# Convert Excel to Markdown tables
transmutation convert data.xlsx -o tables.md --format markdown
# Convert to CSV for analysis
transmutation convert data.xlsx -o data.csv --format csv
# Convert to JSON
transmutation convert data.xlsx -o data.json --format json# Extract text from PowerPoint slides
transmutation convert presentation.pptx \
-o slides/ \
--split-slides \
--extract-images \
--format markdown
# Batch process training materials
transmutation batch trainings/*.pptx \
-o content/ \
--split-slides \
--parallel 8# Convert saved HTML pages
transmutation batch pages/*.html \
-o markdown/ \
--format markdown \
--normalize-whitespace
# Process downloaded documentation
transmutation batch docs/*.html \
-o processed/ \
--extract-images \
--parallel 4pub struct ConversionOptions {
// Output control
pub split_pages: bool, // Split output by pages
pub optimize_for_llm: bool, // Optimize for LLM processing
pub max_chunk_size: usize, // Maximum chunk size (tokens)
// Quality settings
pub image_quality: ImageQuality, // High, Medium, Low
pub dpi: u32, // DPI for image output (default: 150)
pub ocr_language: String, // OCR language (default: "eng")
// Processing options
pub preserve_layout: bool, // Preserve document layout
pub extract_tables: bool, // Extract tables separately
pub extract_images: bool, // Extract embedded images
pub include_metadata: bool, // Include document metadata
// Optimization
pub compression_level: u8, // 0-9 for output compression
pub remove_headers_footers: bool,
pub remove_watermarks: bool,
pub normalize_whitespace: bool,
}| Feature | Transmutation | Docling |
|---|---|---|
| Language | 100% Rust | Python |
| Performance | β 250x faster | Baseline |
| Memory Usage | β ~20MB | ~2-3GB |
| Dependencies | β Zero runtime deps | Python + ML models |
| Deployment | β Single binary (~5MB) | Python env + models (~2GB) |
| Startup Time | β <100ms | ~5-10s |
| Platform Support | β Windows/Mac/Linux | Requires Python |
- LangChain: Document loaders and text splitters
- LlamaIndex: Document readers and node parsers
- Haystack: Document converters and preprocessors
- DSPy: Optimized document processing
Test Document: Attention Is All You Need (arXiv:1706.03762v7.pdf)
Size: 2.22 MB, 15 pages
| Metric | Transmutation | Docling | Improvement |
|---|---|---|---|
| Conversion Time | 0.21s | 52.68s | β 250x faster |
| Processing Speed | 71 pages/sec | 0.28 pages/sec | β 254x faster |
| Memory Usage | ~20MB | ~2-3GB | β 100-150x less |
| Startup Time | <0.1s | ~6s | β 60x faster |
| Output Quality (Fast) | 71.8% similarity | 100% (reference) | |
| Output Quality (Precision) | 77.3% similarity | 100% (reference) |
| Operation | Input Size | Time | Throughput |
|---|---|---|---|
| PDF β Markdown | 2.2MB (15 pages) | 0.21s | 71 pages/s β |
| PDF β Markdown | 10MB (100 pages) | ~1.4s | 71 pages/s |
| Batch (1,000 PDFs) | 2.2GB (15,000 pages) | ~4 min | 3,750 pages/min |
- Base: ~20MB (pure Rust, no Python runtime) β
- Per conversion: Minimal (streaming processing)
- No ML models required (unlike Docling's 2-3GB)
Fast Mode (default) - 71.8% similarity:
- β 250x faster than Docling
- β Pure Rust with basic text heuristics
- β Works on any PDF without training
- β Zero runtime dependencies
Precision Mode (--precision) - 77.3% similarity:
- β 250x faster than Docling (same speed as fast mode)
- β Enhanced text processing with space correction
- β +5.5% better than fast mode
- β No hardcoded rules, all generic heuristics
Why not 95%+ similarity?
Docling uses:
docling-parse(C++ library) - Extracts text with precise coordinates, fonts, and layout info- LayoutModel (ML) - Deep learning to detect block types (headings, paragraphs, tables) visually
- ReadingOrderModel (ML) - ML-based reading order determination
Transmutation provides three modes:
1. Fast Mode (default):
- Pure Rust text extraction (
pdf-extract) - Generic heuristics (no ML)
- 71.8% similarity, 250x faster
2. Precision Mode (--precision):
- Enhanced text processing
- Generic heuristics + space correction
- 77.3% similarity, 250x faster
Future: C++ FFI Mode - Direct integration with docling-parse (no Python):
- Will use C++ library via FFI for 95%+ similarity
- No Python dependency, pure Rust + C++ shared library
- In development
| Mode | Similarity | Speed | Memory | Dependencies |
|---|---|---|---|---|
| Fast | 71.8% | 250x | 50 MB | None (pure Rust) |
| Precision | 77.3% | 250x | 50 MB | None (pure Rust) |
| FFI (future) | 95%+ | ~50x | 100 MB | C++ shared lib only |
See ROADMAP.md for detailed development plan.
- β Project structure and architecture
- β Core converter interfaces
- β PDF conversion (pure Rust - pdf-extract)
- β Advanced Markdown output with intelligent paragraph joining
- β 98x faster than Docling benchmark achieved (97 papers tested)
- β Windows MSI installer with dependency management
- β Custom icons and professional branding
- β Multi-platform installation scripts (5 variants)
- β Build-time dependency detection
- β Comprehensive documentation
- β DOCX conversion (Markdown + Images - Pure Rust)
- β XLSX conversion (Markdown/CSV/JSON - Pure Rust, 148 pg/s)
- β PPTX conversion (Markdown/Images - Pure Rust, 1639 pg/s)
- β HTML/XML conversion (Pure Rust, 2110-2353 pg/s)
- β Text formats (TXT, CSV, TSV, RTF, ODT - Pure Rust)
- β 11 formats total (8 production, 2 beta)
- β Core formats always enabled (no feature flags)
- β Simplified API and user experience
- β Faster compilation
- β Archive handling (ZIP, TAR, TAR.GZ - 1864 pg/s)
- β Batch processing (Concurrent with Tokio - 4,627 pg/s)
- β Image OCR (Tesseract - 6 formats, 88x faster than Docling)
- π Performance optimizations
- π Quality improvements (RTF, ODT)
- π Memory optimizations
- π v1.0.0 Release
See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
See CHANGELOG.md for detailed version history and release notes.
Current Version: 0.3.0 (December 6, 2025)
- GitHub: https://github.com/hivellm/transmutation
- Documentation: https://docs.hivellm.org/transmutation
- Changelog: CHANGELOG.md
- Docling Project: https://github.com/docling-project
- HiveLLM Vectorizer: https://github.com/hivellm/vectorizer
Built with β€οΈ by the HiveLLM Team
Pure Rust implementation - No Python, no ML model dependencies
Powered by:
- lopdf - Pure Rust PDF parsing
- docx-rs - Pure Rust DOCX parsing
- Tesseract - OCR engine (optional)
- FFmpeg - Multimedia processing (optional)
Inspired by Docling, but built to be faster, lighter, and easier to deploy.
Status: β v0.3.0 - Performance & Memory Optimization Release
Latest Updates (v0.3.0):
- β‘ Memory Optimization: Cached regex patterns, pre-allocated buffers
- π§ Fixed O(nΒ²) Issue: Page extraction now O(n) for split-pages mode
- π Reduced Memory Pressure: Early release of PDF bytes after extraction
- π Lower Memory Footprint: Especially beneficial for library usage