A blockchain digging framework for constructing vulnerability-tagged smart contract datasets.
The DIVE framework provides a powerful pipeline for blockchain dataset creation through six core components:
Fetch smart contract and account data from public blockchains.
- ✅ Currently supports
Ethereum. - 🔗 Uses
Etherscan.ioas a data source. - 📊 Collects:
- Contract metadata
- Account-level information
- Opcodes
Retrieve and store verified contract source code as .sol files using Solidity.
Extract structured features from various smart contract attributes, including: ABI, Timestamp, Library, TransactionIndex, Code Metrics, Input / Bytecode, and Opcode
Merge extracted features with ground-truth vulnerability labels to build a structured dataset.
Clean, normalize, and transform the data to prepare it for downstream analysis or machine learning tasks.
Generate statistical summaries and visualizations to better understand the dataset's structure and characteristics.
-
Python= 3.12.2 -
solidity-code-metrics= 0.0.26Install using one of the following:
# Using Yarn yarn global add solidity-code-metrics@0.0.26 # Or using npm npm install -g solidity-code-metrics@0.0.26
-
🔑 Etherscan API Key
Create an account at
Etherscan.ioand follow theirAPI key guide.
⚠️ Do not share your API key publicly. -
Python dependencies are listed in
requirements.txt.
You can install them using:pip install -r requirements.txt
DIVE/
├── Datasets/ # Generated datasets
│ ├── InitialCombinedData/ # Merged raw features before preprocessing
│ └── PreprocessedData/ # Cleaned, transformed datasets for ML
│
├──Docs/
│ ├── initial-setup.md # Step-by-step guide for project installation and configuration
│ └── usage.md # Detailed documentation for using framework functions and scripts
│
├── Features/ # Extracted features
│ ├── API-based/ # Features collected from Etherscan APIs
│ │ ├── AccountInfo/ # Account-level features
│ │ ├── BlockInfo/ # Block transaction counts
│ │ ├── ContractsInfo/ # Contract metadata from Etherscan
│ │ └── Opcodes/ # Opcode data from Etherscan
│ ├── FE-based/ # Feature engineering outputs
│ │ ├── ABI-based/ # Features extracted from ABI
│ │ ├── CodeMetrics/ # Code metric data
│ │ │ ├── CodeMetrics/ # Parsed metric values
│ │ │ └── Reports/ # Raw/edited Markdown metric reports
│ │ │ ├── EditedReports/
│ │ │ ├── OriginalReports/
│ │ │ └── Raw_CodeMetrics/
│ │ ├── Input-based/ # Features derived from the Input attribute
│ │ ├── Library-based/ # Features derived from the Library attribute
│ │ ├── Opcode-based/ # Features derived from opcode-level analysis
│ │ ├── Timestamp-based/ # Features derived from the Timestamp attribute
│ │ └── TransactionIndex/ # Features derived from the TransactionIndex attribute
│
├── Labels/ # Ground-truth labels for contracts
│
├── RawData/ # Data collected or downloaded
│ ├── Samples/ # Extracted Solidity source code samples
│ ├── SamplesSummary/ #
│ └── SC_Addresses/ # CSVs of smart contract addresses
│
├── Scripts/ # Main processing and utility scripts
│
│ ├── FeatureExtraction/ # Scripts for extracting low-level features
│ │ ├── EVM_Opcodes/ # Contains opcode-related resources
│ │ │ ├── EVM_Opcodes_*.xlsx # Excel file(s) listing EVM opcodes and metadata
│ │ ├── ABI_FeatureExtraction.py # Extracts features from ABI (Application Binary Interface)
│ │ ├── Bytecode_FeatureExtraction.py# Extracts bytecode-level features
│ │ ├── get_CodeMetrics.py # Calls external tools (i.e., solidity-code-metrics) to compute code metrics
│ │ ├── get_OpcodesList.py # Generates the EVM opcode reference list (EVM_Opcodes_*.xlsx)
│ │ ├── Library_FeatureExtraction.py # Extracts library-based features
│ │ ├── Opcode_FeatureExtraction.py # Extracts features from opcodes (e.g., opcode metrics)
│ │ ├── Timestamp_FeatureExtraction.py # Extracts timestamp-based features
│ │ └── transactionIndex_FeatureExtraction.py # Extracts transactionIndex-based features
│
│ ├── FeatureSelection/ # Script for selecting relevant features for analysis/modeling
│ │ └── get_FilteredFeatures.py # Applies feature selection (uses classification defined in Feature list.xlsx)
│
│ ├── apply_DataPreprocessing.py # Cleans, normalizes, and transforms data
│ ├── apply_FeatureExtraction.py # Coordinates the execution of multiple feature extraction steps
│ ├── construct_FinalData.py # Merges feature sets and labels to construct the final dataset
│ ├── extract_SourceCodes.py # Extracts Solidity source code (included in Etherscan API responses)
│ ├── get_Addresses.py # Loads and filters smart contract addresses from input CSV files
│ ├── get_BlockFeatures.py # Retrieves transaction counts for each block
│ ├── get_ContractFeatures.py # Orchestrates retrieval of contract info from Etherscan
│ └── get_DataStatistics.py # Generates summary statistics and visualizations for the dataset
│
├── Statistics/ # Analysis outputs and statistical summaries
│
├── config.json # Configuration file for paths and API key
├── DIVE_pipeline.yaml # YAML config defining the full data creation pipeline execution
├── DIVE.ipynb # Interactive notebook for demonstrating the framework
├── Feature list.xlsx # Documentation of features and their descriptions
├── LICENSE.md # License: CC BY-NC 4.0
├── README.md # Project overview and usage instructions
├── requirements.txt # Python package dependencies
└── run_DIVE_Pipeline.py # Entrypoint to run the entire pipeline as a script
- See full instructions in
Docs/initial-setup.md
- Each function is explained in detail in
Docs/usage.md
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
🚫 Patent Rights Reserved
- This project may be covered by pending or granted patents. The authors reserve all rights under applicable patent laws.
- The use of this software does not grant any rights to use patented inventions.
- For commercial licensing or patent-related inquiries, please contact the authors directly.
🛡️ Disclaimer
- DIVE is provided as a research tool and is under active development. While we strive for reliability, we do not provide warranties or guarantees. Please use it responsibly and at your own discretion.