Machine LearningโBased Phishing URL Detection System with Threat Intelligence & Visualization
Created by farixzz
Click here to try the live application!
This is a production-grade security tool that leverages a state-of-the-art machine learning pipeline and real-time threat intelligence to detect phishing URLs. The project features a dual interface: a user-friendly web GUI for interactive analysis and a powerful CLI for automation and batch processing.
-
๐ Batch Analysis & Threat Mapping (GUI):
- Upload a CSV file containing hundreds of URLs for efficient batch processing.
- Visualize the geographic locations of detected phishing domains on an interactive Global Threat Map.
- Download the complete analysis results as a CSV file.
-
๐ง Production-Grade Machine Learning Model:
- A sophisticated pipeline combining a TF-IDF Vectorizer with a powerful LightGBM classifier, trained on a massive dataset of over 500,000 verified URLs.
- Data-Driven Auto-Thresholding: The detection threshold isn't a guess. It's automatically calculated using ROC curve analysis during each training cycle to maintain a security-first posture (optimized for >= 95% recall).
-
๐ฉ Explainable AI (XAI):
- When a phishing URL is detected, the tool provides a list of "Potential Red Flags" (e.g., presence of suspicious keywords, use of an IP address), explaining why a URL was flagged.
-
๐ก Real-Time API Intelligence:
- Enhances ML predictions by cross-referencing URLs with the VirusTotal API, leveraging data from over 70 security vendors.
-
๐ฅ๏ธ Dual Interface for All Users:
- Graphical User Interface (GUI): An intuitive and polished web application built with Streamlit.
- Powerful Command-Line Interface (CLI): A feature-rich CLI for scripting, automation, and integration, supporting JSON and CSV output.
-
๐จ Enterprise-Grade Alerting:
- Includes a SIEM integration feature that can send high-confidence alerts in the standard Common Event Format (CEF), allowing it to plug into a professional Security Operations Center (SOC).
flowchart LR
U[User / Analyst] --> UI[Streamlit UI / CLI]
UI --> N[URL Normalizer]
N --> F[Feature Extraction]
F --> M[ML Pipeline<br/>TF-IDF + LightGBM]
M --> P[Probability Scoring<br/>+ Auto Threshold]
P --> E[Explainability Engine<br/>Red Flags]
P --> TI[Threat Intelligence<br/>VirusTotal API]
P --> G[Threat Map<br/>Geo Visualization]
P --> S[SIEM Alerts<br/>CEF Format]
TI --> UI
G --> UI
E --> UI
## ๐ ๏ธ Technology Stack
- **Backend & ML:** Python, Scikit-learn, LightGBM, Pandas, Joblib
- **GUI:** Streamlit
- **CLI:** Click
- **Visualization:** Folium, Streamlit-Folium
- **APIs & Data:** Requests, tldextract
- **Packaging:** PyInstaller
- **Version Control:** Git, Git LFS
---
## ๐ Setup and Installation
1. **Clone this repository:**
```bash
# Replace the URL with your repository's URL from GitHub
git clone https://github.com/farixzz/phishing-detector-ml.git
cd phishing-detector-ml
```
2. **Set up the environment:**
```bash
python3 -m venv venv
# On Windows: venv\Scripts\activate
# On Linux/macOS: source venv/bin/activate
```
3. **Install dependencies:**
```bash
pip install -r requirements.txt
```
4. **Configure API Keys:**
- Create a copy of `config.py.template` and rename it to `config.py`.
- Open `config.py` and add your free API key from **VirusTotal**.
---
## ๐ป How to Use
1. **Graphical User Interface (GUI)**
- Launch the Streamlit web application for the most user-friendly experience.
```bash
streamlit run app.py
- Command-Line Interface (CLI)
- The CLI is ideal for automation and batch processing.
python main.py --url "https://www.example.com"- Analyze URLs from a File and Save Results
python main.py --input-file urls.txt --output-file results.csv- Get JSON Output for Scripting
python main.py --url "http://suspicious-site.com" --json-output๐ Project Structure
phishing-detector-ml/
โโโ data/ # Raw datasets used for aggregation
โโโ models/ # Trained .joblib model (tracked with Git LFS)
โโโ .gitignore # Git ignore rules (data/, config.py, etc.)
โโโ README.md # Project documentation
โโโ aggregate_data.py # Dataset aggregation & cleaning
โโโ api_checker.py # VirusTotal API integration
โโโ app.py # Streamlit GUI application
โโโ config.py.template # Template for local API keys
โโโ detector.py # Core analysis & prediction logic
โโโ geo_utils.py # Threat map geolocation helper
โโโ main.py # CLI entry point
โโโ requirements.txt # Minimal, cloud-safe dependencies
โโโ siem_alerter.py # CEF-based SIEM alert generator
โโโ train_model.py # Model training with auto-threshold tuning
โโโ ui_helpers.py # Explainability (Red Flags) logic
โโโ url_normalizer.py # URL normalization & preprocessing๐ License
This project is intended for educational and research purposes only. Do not use this tool for illegal, unethical, or unauthorized activities.
farixzz
- ๐ Portfolio: https://farixzz.github.io
- ๐ GitHub: https://github.com/farixzz
If you found this project useful, feel free to โญ the repository!
