Skip to content

DSToolkit is a modular Python library designed to accelerate applied data science workflows. It provides AutoML pipelines, model evaluation utilities, interpretability tools, and feature engineering helpers for: Classification, Regression and Clustering.

License

Notifications You must be signed in to change notification settings

juniorcl/data-science-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science Toolkit

DSToolkit is a modular Python library designed to accelerate applied data science workflows.
It provides AutoML pipelines, model evaluation utilities, interpretability tools, and feature engineering helpers for:

  • Classification
  • Regression
  • Clustering

The library follows a scikit-learn–inspired API, focusing on:

  • Reproducibility
  • Clean abstractions
  • Practical diagnostics for real-world ML problems

Key Features

AutoML

  • Automated training pipelines for:
    • Classification
    • Regression
    • Clustering
  • Built-in support for:
    • LightGBM
    • CatBoost
    • HistGradientBoosting
    • KMeans
    • Gaussian Mixture Models
  • Hyperparameter optimization
  • Holdout and cross-validation strategies

Model Analysis & Metrics

  • Classification metrics:
    • ROC, PR, KS, calibration curves
    • Custom scorers and lift-based metrics
  • Regression diagnostics:
    • Residual analysis
    • Error by quantile
    • True vs predicted
  • Clustering evaluation:
    • Silhouette analysis
    • Cluster size distribution
    • Feature statistics per cluster

Interpretability

  • SHAP-based explanations
  • Feature importance
  • Decision tree surrogates (OvR and OvO)

Feature Engineering

  • Encoders and wrappers compatible with sklearn pipelines
  • Custom transformation utilities

Project Structure

src/dstoolkit/
├── automl/                # AutoML pipelines
├── feature_engine/        # Feature engineering utilities
├── metrics/               # Custom metrics and plots
├── model/
│   ├── analysis/          # Model diagnostics and visualization
│   └── interpretability/  # Explainability tools
├── preprocessing/         # Data preprocessing helpers

Example notebooks demonstrating usage can be found in the notebooks/ directory.

Instalation

pip install -r requirements.txt

Or, for development:

pip install -e .

Quick Example

from dstoolkit.automl.classifier import AutoMLLightGBMClassifier

automl = AutoMLLightGBMClassifier(
    scoring="roc_auc",
    n_trials=50
)

automl.fit(X_train, y_train)
automl.evaluate(X_test, y_test)

Documentation

Detailed documentation is available in the docs/ directory:

  • AutoML APIs
  • Metrics and scoring
  • Model analysis and interpretability
  • Feature engineering utilities

License

This project is licensed under the MIT License - see the LICENSE file for details

About

DSToolkit is a modular Python library designed to accelerate applied data science workflows. It provides AutoML pipelines, model evaluation utilities, interpretability tools, and feature engineering helpers for: Classification, Regression and Clustering.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published