Skip to content

wissemgrari/phishing_detection_model

Repository files navigation

Phishing URL Detection Using Machine Learning

A comprehensive machine learning project for detecting phishing URLs using multiple approaches including TF-IDF, BERT, and hybrid models. This project implements and compares various classification algorithms to identify malicious URLs with high accuracy.

📋 Table of Contents

🎯 Overview

Phishing attacks remain one of the most prevalent cybersecurity threats. This project develops and evaluates multiple machine learning models to automatically classify URLs as either phishing or legitimate. The project explores three main approaches:

  1. TF-IDF based models - Traditional feature extraction using Term Frequency-Inverse Document Frequency
  2. BERT based models - Deep learning approach using pre-trained BERT embeddings
  3. Hybrid models - Combination of TF-IDF and BERT features for enhanced performance

✨ Features

  • Balanced Dataset: 10,000 URLs (5,000 phishing + 5,000 legitimate)
  • Multiple ML Algorithms: SVM, Decision Trees, Random Forest, XGBoost, Neural Networks
  • Multiple Feature Extraction Methods: TF-IDF, BERT embeddings, and hybrid approaches
  • Comprehensive Evaluation: Detailed performance metrics and model comparisons
  • End-to-End Pipeline: From data preprocessing to model evaluation

📊 Dataset

The dataset combines URLs from two sources:

Dataset Structure

url,type,label
https://example.com,legitimate,0
https://malicious-site.com,phishing,1
  • url: The URL string
  • type: Label as 'legitimate' or 'phishing'
  • label: Binary label (0 for legitimate, 1 for phishing)

📁 Project Structure

phishing_detection_model/
├── README.md
├── requirements.txt
├── datasets/
│   ├── dataset.csv              # Combined dataset
│   └── processed_dataset.csv    # Processed dataset with features
├── notebooks/
│   ├── 1-dataset-preparation.ipynb
│   ├── 2-feature-extraction.ipynb
│   ├── 3-model-engineering.ipynb
│   ├── 4-model-engineering-TFID.ipynb
│   ├── 5-model-engineering-BERT.ipynb
│   └── 6-model-engineering-TFIDF+BERT.ipynb

Notebooks Description

  1. 1-dataset-preparation.ipynb: Data collection, cleaning, and balancing
  2. 2-feature-extraction.ipynb: URL feature engineering and preprocessing
  3. 3-model-engineering.ipynb: Initial model development and evaluation
  4. 4-model-engineering-TFID.ipynb: TF-IDF based classification models
  5. 5-model-engineering-BERT.ipynb: BERT-based deep learning models
  6. 6-model-engineering-TFIDF+BERT.ipynb: Hybrid approach combining both methods

🚀 Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup

  1. Clone the repository:
git clone https://github.com/wissemgrari/phishing_detection_model.git
cd phishing_detection_model
  1. Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

💻 Usage

Running the Notebooks

  1. Start Jupyter Notebook:
jupyter notebook
  1. Run notebooks in order:
    • Start with 1-dataset-preparation.ipynb to prepare the dataset
    • Progress through each notebook sequentially
    • Each notebook builds upon the previous ones

Quick Start Example

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# Load dataset
data = pd.read_csv('datasets/dataset.csv')

# Vectorize URLs using TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(data['url'])
y = data['label']

# Train model
model = RandomForestClassifier()
model.fit(X, y)

# Predict
prediction = model.predict(tfidf.transform(['http://suspicious-url.com']))

🤖 Models

The project implements and compares the following classifiers:

Traditional ML Models

  • Support Vector Machine (SVM)
  • Decision Tree Classifier
  • Random Forest Classifier
  • XGBoost Classifier

Deep Learning Models

  • Neural Networks with TF-IDF features
  • BERT-based classifiers
  • Hybrid TF-IDF + BERT models

Feature Extraction Methods

  1. TF-IDF (Term Frequency-Inverse Document Frequency)

    • Converts URLs into numerical vectors
    • Captures character-level and token-level patterns
    • Fast and efficient for traditional ML algorithms
  2. BERT (Bidirectional Encoder Representations from Transformers)

    • Pre-trained language model embeddings
    • Captures semantic and contextual information
    • State-of-the-art performance for text classification
  3. Hybrid Approach

    • Combines TF-IDF and BERT features
    • Leverages both statistical and semantic information
    • Potential for improved classification accuracy

📈 Results

Performance metrics for each model are documented in the respective notebooks. Key evaluation metrics include:

  • Accuracy: Overall classification accuracy
  • Precision: Proportion of correct positive predictions
  • Recall: Proportion of actual positives correctly identified
  • F1-Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed breakdown of predictions

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • PhishTank for providing phishing URL data
  • Kaggle for the benign URLs dataset
  • The open-source community for the amazing libraries and tools

📧 Contact

Wissem Grari - @wissemgrari

Project Link: https://github.com/wissemgrari/phishing_detection_model


⭐ If you find this project helpful, please consider giving it a star!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published