DSToolkit is a modular Python library designed to accelerate applied data science workflows.
It provides AutoML pipelines, model evaluation utilities, interpretability tools, and feature engineering helpers for:
- Classification
- Regression
- Clustering
The library follows a scikit-learn–inspired API, focusing on:
- Reproducibility
- Clean abstractions
- Practical diagnostics for real-world ML problems
- Automated training pipelines for:
- Classification
- Regression
- Clustering
- Built-in support for:
- LightGBM
- CatBoost
- HistGradientBoosting
- KMeans
- Gaussian Mixture Models
- Hyperparameter optimization
- Holdout and cross-validation strategies
- Classification metrics:
- ROC, PR, KS, calibration curves
- Custom scorers and lift-based metrics
- Regression diagnostics:
- Residual analysis
- Error by quantile
- True vs predicted
- Clustering evaluation:
- Silhouette analysis
- Cluster size distribution
- Feature statistics per cluster
- SHAP-based explanations
- Feature importance
- Decision tree surrogates (OvR and OvO)
- Encoders and wrappers compatible with sklearn pipelines
- Custom transformation utilities
src/dstoolkit/
├── automl/ # AutoML pipelines
├── feature_engine/ # Feature engineering utilities
├── metrics/ # Custom metrics and plots
├── model/
│ ├── analysis/ # Model diagnostics and visualization
│ └── interpretability/ # Explainability tools
├── preprocessing/ # Data preprocessing helpers
Example notebooks demonstrating usage can be found in the notebooks/ directory.
pip install -r requirements.txtOr, for development:
pip install -e .from dstoolkit.automl.classifier import AutoMLLightGBMClassifier
automl = AutoMLLightGBMClassifier(
scoring="roc_auc",
n_trials=50
)
automl.fit(X_train, y_train)
automl.evaluate(X_test, y_test)Detailed documentation is available in the docs/ directory:
- AutoML APIs
- Metrics and scoring
- Model analysis and interpretability
- Feature engineering utilities
This project is licensed under the MIT License - see the LICENSE file for details