A comprehensive machine learning project for detecting phishing URLs using multiple approaches including TF-IDF, BERT, and hybrid models. This project implements and compares various classification algorithms to identify malicious URLs with high accuracy.
Phishing attacks remain one of the most prevalent cybersecurity threats. This project develops and evaluates multiple machine learning models to automatically classify URLs as either phishing or legitimate. The project explores three main approaches:
- TF-IDF based models - Traditional feature extraction using Term Frequency-Inverse Document Frequency
- BERT based models - Deep learning approach using pre-trained BERT embeddings
- Hybrid models - Combination of TF-IDF and BERT features for enhanced performance
- Balanced Dataset: 10,000 URLs (5,000 phishing + 5,000 legitimate)
- Multiple ML Algorithms: SVM, Decision Trees, Random Forest, XGBoost, Neural Networks
- Multiple Feature Extraction Methods: TF-IDF, BERT embeddings, and hybrid approaches
- Comprehensive Evaluation: Detailed performance metrics and model comparisons
- End-to-End Pipeline: From data preprocessing to model evaluation
The dataset combines URLs from two sources:
- Phishtank: 5,000 phishing URLs from PhishTank
- Kaggle Benign URLs: 5,000 legitimate URLs from Malicious and Benign URLs Dataset
url,type,label
https://example.com,legitimate,0
https://malicious-site.com,phishing,1
- url: The URL string
- type: Label as 'legitimate' or 'phishing'
- label: Binary label (0 for legitimate, 1 for phishing)
phishing_detection_model/
├── README.md
├── requirements.txt
├── datasets/
│ ├── dataset.csv # Combined dataset
│ └── processed_dataset.csv # Processed dataset with features
├── notebooks/
│ ├── 1-dataset-preparation.ipynb
│ ├── 2-feature-extraction.ipynb
│ ├── 3-model-engineering.ipynb
│ ├── 4-model-engineering-TFID.ipynb
│ ├── 5-model-engineering-BERT.ipynb
│ └── 6-model-engineering-TFIDF+BERT.ipynb
- 1-dataset-preparation.ipynb: Data collection, cleaning, and balancing
- 2-feature-extraction.ipynb: URL feature engineering and preprocessing
- 3-model-engineering.ipynb: Initial model development and evaluation
- 4-model-engineering-TFID.ipynb: TF-IDF based classification models
- 5-model-engineering-BERT.ipynb: BERT-based deep learning models
- 6-model-engineering-TFIDF+BERT.ipynb: Hybrid approach combining both methods
- Python 3.8 or higher
- pip package manager
- Clone the repository:
git clone https://github.com/wissemgrari/phishing_detection_model.git
cd phishing_detection_model- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Start Jupyter Notebook:
jupyter notebook- Run notebooks in order:
- Start with
1-dataset-preparation.ipynbto prepare the dataset - Progress through each notebook sequentially
- Each notebook builds upon the previous ones
- Start with
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
# Load dataset
data = pd.read_csv('datasets/dataset.csv')
# Vectorize URLs using TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(data['url'])
y = data['label']
# Train model
model = RandomForestClassifier()
model.fit(X, y)
# Predict
prediction = model.predict(tfidf.transform(['http://suspicious-url.com']))The project implements and compares the following classifiers:
- Support Vector Machine (SVM)
- Decision Tree Classifier
- Random Forest Classifier
- XGBoost Classifier
- Neural Networks with TF-IDF features
- BERT-based classifiers
- Hybrid TF-IDF + BERT models
-
TF-IDF (Term Frequency-Inverse Document Frequency)
- Converts URLs into numerical vectors
- Captures character-level and token-level patterns
- Fast and efficient for traditional ML algorithms
-
BERT (Bidirectional Encoder Representations from Transformers)
- Pre-trained language model embeddings
- Captures semantic and contextual information
- State-of-the-art performance for text classification
-
Hybrid Approach
- Combines TF-IDF and BERT features
- Leverages both statistical and semantic information
- Potential for improved classification accuracy
Performance metrics for each model are documented in the respective notebooks. Key evaluation metrics include:
- Accuracy: Overall classification accuracy
- Precision: Proportion of correct positive predictions
- Recall: Proportion of actual positives correctly identified
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed breakdown of predictions
Contributions are welcome! Please feel free to submit a Pull Request. For major changes:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- PhishTank for providing phishing URL data
- Kaggle for the benign URLs dataset
- The open-source community for the amazing libraries and tools
Wissem Grari - @wissemgrari
Project Link: https://github.com/wissemgrari/phishing_detection_model
⭐ If you find this project helpful, please consider giving it a star!