Enhanced Master Data Management (MDM) Package with Multi-Database Support
EasyMDM Advanced is a comprehensive Python package for Master Data Management (MDM), record linkage, and data deduplication. It provides advanced algorithms for similarity matching, flexible survivorship rules, and support for multiple database systems.
- PostgreSQL β Full support with connection pooling and optimization
- SQL Server β Native support via ODBC drivers
- SQLite β Lightweight embedded database support
- DuckDB β High-performance analytical database
- CSV Files β Direct file processing with optimized performance
- Jaro-Winkler β Optimized for names and short strings
- Levenshtein β Edit distance with normalization
- Cosine Similarity β TF-IDF based for longer text
- Jaccard β Set-based similarity with n-grams
- Exact Match β Optimized exact comparisons
- FuzzyWuzzy β Multiple fuzzy matching variants
- Exact Blocking β Traditional exact key matching
- Fuzzy Blocking β Similarity-based candidate generation
- Sorted Neighborhood β Sliding window approach
- RecordLinkage Integration β Standard blocking methods
- Network-based β Graph connectivity clustering (default)
- Hierarchical β Distance-based hierarchical clustering
- DBSCAN β Density-based clustering with noise detection
- Priority Rules β Condition-based record selection
- Most Recent β Date/timestamp-based resolution
- Source Priority β Trust-based source ordering
- Longest String β Length-based text selection
- Value-based β Highest/lowest numeric values
- Threshold-based β Conditional value selection
- Most Frequent β Frequency-based resolution
- Parallel Processing β Multi-core similarity computation
- Vectorized Operations β NumPy/Pandas optimization
- Caching β Intelligent similarity caching
- Batch Processing β Memory-efficient large dataset handling
- Numba Integration β JIT compilation for critical paths
# Core dependencies
pip install pandas numpy pyyaml networkx
# Similarity and record linkage
pip install recordlinkage fuzzywuzzy python-Levenshtein jellyfish textdistance
# Database connectors
pip install sqlalchemy psycopg2-binary pyodbc duckdb
# Performance and ML
pip install scikit-learn numba joblib
# Optional: Rich UI and visualization
pip install rich matplotlib seaborn plotly# From source
git clone https://github.com/yourusername/easymdm-advanced.git
cd easymdm-advanced
pip install -e .
# Or from PyPI (when available)
pip install easymdm-advanced# PostgreSQL example
easymdm create-config --output config.yaml --database-type postgresql
# CSV example
easymdm create-config --output config.yaml --database-type csv --file-path data.csveasymdm test-connection --config config.yaml --sample-size 10
easymdm validate-config --config config.yamleasymdm process --config config.yaml --output ./results --profile --test-configfrom easymdm import MDMEngine, MDMConfig
# Load configuration
config = MDMConfig.from_yaml('config.yaml')
# Create and run MDM engine
engine = MDMEngine(config)
# Test configuration
test_results = engine.test_configuration()
print("Configuration test:", all(test_results.values()))
# Profile input data
profile = engine.get_data_profile()
print(f"Input records: {profile['total_records']:,}")
# Execute MDM processing
result = engine.process()
print(f"Golden records created: {len(result.golden_records):,}")
print(f"Processing time: {result.execution_time:.2f} seconds")
print(f"Output files: {result.output_files}")PostgreSQL
source:
type: postgresql
host: localhost
port: 5432
database: mydb
username: user
password: password
schema: public
table: customersSQL Server
source:
type: sqlserver
host: localhost
port: 1433
database: CustomerDB
username: user
password: password
schema: dbo
table: Customers
options:
driver: "ODBC Driver 17 for SQL Server"CSV Files
source:
type: csv
file_path: ./data/customers.csv
options:
encoding: utf-8
delimiter: ","
na_values: ["", "NULL", "N/A"]similarity:
- column: first_name
method: jarowinkler
weight: 2.0
threshold: 0.7
options:
lowercase: truesurvivorship:
rules:
- column: last_updated
strategy: most_recentpriority_rule:
conditions:
- column: is_verified
value: true
priority: 1- Custom Similarity Functions β define your own matcher class
- Batch Processing β handle large datasets efficiently with multiprocessing
- Performance Benchmarking β test similarity methods and blocking strategies
golden_records_TIMESTAMP.csvβ Deduplicated golden recordsreview_pairs_TIMESTAMP.csvβ Pairs for manual reviewprocessing_summary_TIMESTAMP.txtβ Human-readable summarydetailed_stats_TIMESTAMP.jsonβ Machine-readable statistics
- Memory: Chunked processing, vectorized operations, caching
- CPU: Parallel processing, Numba JIT, batch operations
- I/O: Connection pooling, bulk read/write, compression
- Database connection errors β test connection and check drivers
- Memory issues β reduce batch size, enable chunking
- Slow similarity β benchmark methods, optimize blocking
| Feature | Original | Advanced |
|---|---|---|
| Databases | CSV, SQLite, DuckDB | + PostgreSQL, SQL Server |
| Similarity | Basic | + Cosine, Jaccard, Fuzzy variants |
| Blocking | Fuzzy only | + Exact, Sorted Neighborhood, RecordLinkage |
| Clustering | Network only | + Hierarchical, DBSCAN |
| Survivorship | Basic | + 8 advanced strategies |
| Performance | Single-thread | Multi-core, Numba JIT, vectorized |
| CLI | Basic | Rich UI, comprehensive commands |
| Output | CSV only | + Review pairs, stats, multiple formats |
| Memory | Load all | + Chunking, streaming, optimization |
Fork, create a feature branch, commit, push, and open a Pull Request. Use development mode (pip install -e ".[dev]") and run tests with pytest.
MIT License β see LICENSE file for details.
- Documentation: ReadTheDocs
- GitHub Issues & Discussions
- Email: support@easymdm-advanced.com