Skip to content

DIVE4Data/DIVE

Repository files navigation

DIVE Framework

A blockchain digging framework for constructing vulnerability-tagged smart contract datasets.


🔍 Key Features

The DIVE framework provides a powerful pipeline for blockchain dataset creation through six core components:


1. 🧾 Feature Collection

Fetch smart contract and account data from public blockchains.

  • ✅ Currently supports Ethereum.
  • 🔗 Uses Etherscan.io as a data source.
  • 📊 Collects:
    • Contract metadata
    • Account-level information
    • Opcodes

2. 🧠 Solidity Code Extraction

Retrieve and store verified contract source code as .sol files using Solidity.

3. 🧪 Feature Extraction

Extract structured features from various smart contract attributes, including: ABI, Timestamp, Library, TransactionIndex, Code Metrics, Input / Bytecode, and Opcode

4. 🏷️ Labeled Data Construction

Merge extracted features with ground-truth vulnerability labels to build a structured dataset.

5. 🧹 Data Preprocessing

Clean, normalize, and transform the data to prepare it for downstream analysis or machine learning tasks.

6. 📊 Statistical Analysis & Visualization

Generate statistical summaries and visualizations to better understand the dataset's structure and characteristics.


📦 Requirements

  • Python = 3.12.2

  • solidity-code-metrics = 0.0.26

    Install using one of the following:

    # Using Yarn
    yarn global add solidity-code-metrics@0.0.26
    
    # Or using npm
    npm install -g solidity-code-metrics@0.0.26
  • 🔑 Etherscan API Key

    Create an account at Etherscan.io and follow their API key guide.
    ⚠️ Do not share your API key publicly.

  • Python dependencies are listed in requirements.txt.
    You can install them using:

    pip install -r requirements.txt

📁 Repository Structure

DIVE/
├── Datasets/                    # Generated datasets
│   ├── InitialCombinedData/     # Merged raw features before preprocessing
│   └── PreprocessedData/        # Cleaned, transformed datasets for ML
│
├──Docs/
│   ├── initial-setup.md        # Step-by-step guide for project installation and configuration
│   └── usage.md                # Detailed documentation for using framework functions and scripts
│
├── Features/                    # Extracted features
│   ├── API-based/               # Features collected from Etherscan APIs
│   │   ├── AccountInfo/         # Account-level features
│   │   ├── BlockInfo/           # Block transaction counts
│   │   ├── ContractsInfo/       # Contract metadata from Etherscan
│   │   └── Opcodes/             # Opcode data from Etherscan
│   ├── FE-based/                # Feature engineering outputs
│   │   ├── ABI-based/           # Features extracted from ABI
│   │   ├── CodeMetrics/         # Code metric data
│   │   │   ├── CodeMetrics/     # Parsed metric values
│   │   │   └── Reports/         # Raw/edited Markdown metric reports
│   │   │       ├── EditedReports/
│   │   │       ├── OriginalReports/
│   │   │       └── Raw_CodeMetrics/
│   │   ├── Input-based/         # Features derived from the Input attribute
│   │   ├── Library-based/       # Features derived from the Library attribute
│   │   ├── Opcode-based/        # Features derived from opcode-level analysis
│   │   ├── Timestamp-based/     # Features derived from the Timestamp attribute
│   │   └── TransactionIndex/    # Features derived from the TransactionIndex attribute
│
├── Labels/                      # Ground-truth labels for contracts
│
├── RawData/                     # Data collected or downloaded
│   ├── Samples/                 # Extracted Solidity source code samples
│   ├── SamplesSummary/          # 
│   └── SC_Addresses/            # CSVs of smart contract addresses
│
├── Scripts/                             # Main processing and utility scripts
│
│   ├── FeatureExtraction/               # Scripts for extracting low-level features
│   │   ├── EVM_Opcodes/                 # Contains opcode-related resources
│   │   │   ├── EVM_Opcodes_*.xlsx       # Excel file(s) listing EVM opcodes and metadata
│   │   ├── ABI_FeatureExtraction.py     # Extracts features from ABI (Application Binary Interface)
│   │   ├── Bytecode_FeatureExtraction.py# Extracts bytecode-level features
│   │   ├── get_CodeMetrics.py           # Calls external tools (i.e., solidity-code-metrics) to compute code metrics
│   │   ├── get_OpcodesList.py           # Generates the EVM opcode reference list (EVM_Opcodes_*.xlsx)
│   │   ├── Library_FeatureExtraction.py # Extracts library-based features
│   │   ├── Opcode_FeatureExtraction.py  # Extracts features from opcodes (e.g., opcode metrics)
│   │   ├── Timestamp_FeatureExtraction.py # Extracts timestamp-based features
│   │   └── transactionIndex_FeatureExtraction.py # Extracts transactionIndex-based features
│
│   ├── FeatureSelection/                # Script for selecting relevant features for analysis/modeling
│   │   └── get_FilteredFeatures.py      # Applies feature selection (uses classification defined in Feature list.xlsx)
│
│   ├── apply_DataPreprocessing.py       # Cleans, normalizes, and transforms data
│   ├── apply_FeatureExtraction.py       # Coordinates the execution of multiple feature extraction steps
│   ├── construct_FinalData.py           # Merges feature sets and labels to construct the final dataset
│   ├── extract_SourceCodes.py           # Extracts Solidity source code (included in Etherscan API responses) 
│   ├── get_Addresses.py                 # Loads and filters smart contract addresses from input CSV files
│   ├── get_BlockFeatures.py             # Retrieves transaction counts for each block
│   ├── get_ContractFeatures.py          # Orchestrates retrieval of contract info from Etherscan
│   └── get_DataStatistics.py            # Generates summary statistics and visualizations for the dataset
│
├── Statistics/                  # Analysis outputs and statistical summaries
│
├── config.json                  # Configuration file for paths and API key
├── DIVE_pipeline.yaml           # YAML config defining the full data creation pipeline execution
├── DIVE.ipynb                   # Interactive notebook for demonstrating the framework
├── Feature list.xlsx            # Documentation of features and their descriptions
├── LICENSE.md                   # License: CC BY-NC 4.0
├── README.md                    # Project overview and usage instructions
├── requirements.txt             # Python package dependencies
└── run_DIVE_Pipeline.py         # Entrypoint to run the entire pipeline as a script

🧭 Getting Started

🔧 Initial Setup

🛠️ Using Framework Functions


📦 License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

🚫 Patent Rights Reserved

  • This project may be covered by pending or granted patents. The authors reserve all rights under applicable patent laws.
  • The use of this software does not grant any rights to use patented inventions.
  • For commercial licensing or patent-related inquiries, please contact the authors directly.

🛡️ Disclaimer

  • DIVE is provided as a research tool and is under active development. While we strive for reliability, we do not provide warranties or guarantees. Please use it responsibly and at your own discretion.

About

A blockchain digging framework for constructing vulnerability-tagged smart contract datasets.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published