Phishing URL Detection Using Machine Learning

A comprehensive machine learning project for detecting phishing URLs using multiple approaches including TF-IDF, BERT, and hybrid models. This project implements and compares various classification algorithms to identify malicious URLs with high accuracy.

🎯 Overview

Phishing attacks remain one of the most prevalent cybersecurity threats. This project develops and evaluates multiple machine learning models to automatically classify URLs as either phishing or legitimate. The project explores three main approaches:

TF-IDF based models - Traditional feature extraction using Term Frequency-Inverse Document Frequency
BERT based models - Deep learning approach using pre-trained BERT embeddings
Hybrid models - Combination of TF-IDF and BERT features for enhanced performance

✨ Features

Balanced Dataset: 10,000 URLs (5,000 phishing + 5,000 legitimate)
Multiple ML Algorithms: SVM, Decision Trees, Random Forest, XGBoost, Neural Networks
Multiple Feature Extraction Methods: TF-IDF, BERT embeddings, and hybrid approaches
Comprehensive Evaluation: Detailed performance metrics and model comparisons
End-to-End Pipeline: From data preprocessing to model evaluation

📊 Dataset

The dataset combines URLs from two sources:

Phishtank: 5,000 phishing URLs from PhishTank
Kaggle Benign URLs: 5,000 legitimate URLs from Malicious and Benign URLs Dataset

Dataset Structure

url,type,label
https://example.com,legitimate,0
https://malicious-site.com,phishing,1

url: The URL string
type: Label as 'legitimate' or 'phishing'
label: Binary label (0 for legitimate, 1 for phishing)

📁 Project Structure

phishing_detection_model/
├── README.md
├── requirements.txt
├── datasets/
│   ├── dataset.csv              # Combined dataset
│   └── processed_dataset.csv    # Processed dataset with features
├── notebooks/
│   ├── 1-dataset-preparation.ipynb
│   ├── 2-feature-extraction.ipynb
│   ├── 3-model-engineering.ipynb
│   ├── 4-model-engineering-TFID.ipynb
│   ├── 5-model-engineering-BERT.ipynb
│   └── 6-model-engineering-TFIDF+BERT.ipynb

Notebooks Description

1-dataset-preparation.ipynb: Data collection, cleaning, and balancing
2-feature-extraction.ipynb: URL feature engineering and preprocessing
3-model-engineering.ipynb: Initial model development and evaluation
4-model-engineering-TFID.ipynb: TF-IDF based classification models
5-model-engineering-BERT.ipynb: BERT-based deep learning models
6-model-engineering-TFIDF+BERT.ipynb: Hybrid approach combining both methods

🚀 Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup

Clone the repository:

git clone https://github.com/wissemgrari/phishing_detection_model.git
cd phishing_detection_model

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

💻 Usage

Running the Notebooks

Start Jupyter Notebook:

jupyter notebook

Run notebooks in order:
- Start with 1-dataset-preparation.ipynb to prepare the dataset
- Progress through each notebook sequentially
- Each notebook builds upon the previous ones

Quick Start Example

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# Load dataset
data = pd.read_csv('datasets/dataset.csv')

# Vectorize URLs using TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(data['url'])
y = data['label']

# Train model
model = RandomForestClassifier()
model.fit(X, y)

# Predict
prediction = model.predict(tfidf.transform(['http://suspicious-url.com']))

🤖 Models

The project implements and compares the following classifiers:

Traditional ML Models

Support Vector Machine (SVM)
Decision Tree Classifier
Random Forest Classifier
XGBoost Classifier

Deep Learning Models

Neural Networks with TF-IDF features
BERT-based classifiers
Hybrid TF-IDF + BERT models

Feature Extraction Methods

TF-IDF (Term Frequency-Inverse Document Frequency)
- Converts URLs into numerical vectors
- Captures character-level and token-level patterns
- Fast and efficient for traditional ML algorithms
BERT (Bidirectional Encoder Representations from Transformers)
- Pre-trained language model embeddings
- Captures semantic and contextual information
- State-of-the-art performance for text classification
Hybrid Approach
- Combines TF-IDF and BERT features
- Leverages both statistical and semantic information
- Potential for improved classification accuracy

📈 Results

Performance metrics for each model are documented in the respective notebooks. Key evaluation metrics include:

Accuracy: Overall classification accuracy
Precision: Proportion of correct positive predictions
Recall: Proportion of actual positives correctly identified
F1-Score: Harmonic mean of precision and recall
Confusion Matrix: Detailed breakdown of predictions

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

PhishTank for providing phishing URL data
Kaggle for the benign URLs dataset
The open-source community for the amazing libraries and tools

📧 Contact

Wissem Grari - @wissemgrari

Project Link: https://github.com/wissemgrari/phishing_detection_model

⭐ If you find this project helpful, please consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
datasets		datasets
notebooks		notebooks
.gitignore		.gitignore
Phishing detection using ML.pdf		Phishing detection using ML.pdf
Phishing detection using ML.pptx		Phishing detection using ML.pptx
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phishing URL Detection Using Machine Learning

📋 Table of Contents

🎯 Overview

✨ Features

📊 Dataset

Dataset Structure

📁 Project Structure

Notebooks Description

🚀 Installation

Prerequisites

Setup

💻 Usage

Running the Notebooks

Quick Start Example

🤖 Models

Traditional ML Models

Deep Learning Models

Feature Extraction Methods

📈 Results

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

wissemgrari/phishing_detection_model

Folders and files

Latest commit

History

Repository files navigation

Phishing URL Detection Using Machine Learning

📋 Table of Contents

🎯 Overview

✨ Features

📊 Dataset

Dataset Structure

📁 Project Structure

Notebooks Description

🚀 Installation

Prerequisites

Setup

💻 Usage

Running the Notebooks

Quick Start Example

🤖 Models

Traditional ML Models

Deep Learning Models

Feature Extraction Methods

📈 Results

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages