PhishGuard: A Machine Learning-powered Phishing Detection System

The Origin Story

The journey of PhishGuard began in August of 2025. As a student at UW-La Crosse, I would constantly get bombarded with phishing emails. Every week, students and faculty fall victim to increasingly sophisticated scams, losing access to their accounts and compromising sensitive data.

IT support environment where the project originated

A professor could receive an email that appeared to be from our university's IT department, complete with official logos and convincing language. The email could have a link to "verify account security" - a classic phishing tactic. Despite their security training, the professor could click the link and enter their credentials, compromising years of valuable data and potentially their careers.

This experience sparked a question: Could we build a tool that would have caught this attack before it caused damage?

The Chrome Extension Challenge

Along with my Partner in this project, Zach Ydunate, our initial approach was to create a Chrome extension that would analyze URLs in real-time as users browsed the web. I spent weeks learning about browser extension development, studying the Chrome API, and designing a system with Zach that could intercept and analyze web requests.

However, I quickly realized the limitations of this approach:

Browser-specific constraints: Each browser has different extension APIs
Limited analysis capabilities: Extensions have restricted access to system resources
User adoption barriers: Users must install and trust the extension
Maintenance overhead: Browser updates frequently break extension functionality

The Pivot to Full-Stack Development

Recognizing these challenges, I made a strategic pivot. Instead of building a browser extension, I decided to create a comprehensive web application that could be accessed by anyone, anywhere, without installation requirements.

This decision led me down the path of full-stack development, where I had to learn:

Backend development with Flask and Python
Frontend design with modern JavaScript and responsive frameworks
Database management for storing user reports and model training data
Machine learning for creating accurate phishing detection models

The Machine Learning Breakthrough

The most challenging and rewarding aspect of the project was developing the machine learning model. I spent months researching existing phishing detection techniques, studying academic papers, and experimenting with different algorithms.

The breakthrough came when I discovered the PhiUSIIL Phishing URL Dataset - a comprehensive collection of over 1 million URLs with detailed feature extraction. This dataset became the foundation for training our model.

Machine learning model development process

Through iterative testing and refinement, I developed a system that could analyze over 50 different features from URLs and HTML content, achieving impressive accuracy rates:

Accuracy: > 95%
Precision: > 94%
Recall: > 93%
F1-Score: > 94%

From Prototype to Production

What started as a simple idea to help protect my campus community evolved into a sophisticated web application used by people around the world. The journey taught me invaluable lessons about:

User-centered design: Creating tools that are both powerful and accessible
Technical problem-solving: Overcoming challenges through research and iteration
Security awareness: Understanding the evolving landscape of cyber threats
Open-source collaboration: Building something that benefits the broader community

The best security tools are those that make protection accessible to everyone, not just experts.

PhishGuard represents more than just a technical achievement - it's a testament to how real-world problems can inspire meaningful innovation. The project continues to evolve, with new features and improvements driven by user feedback and the ever-changing landscape of cybersecurity threats.

A Real Phishing Attack: The Perfect Case Study

To truly understand the importance of PhishGuard, let me walk you through a real phishing attack that demonstrates exactly what our system is designed to detect.

The Attack Scenario

Imagine you're a university student checking your email. You receive a message that appears to be from your university's IT department:

Example of a phishing email targeting university students

The email looks legitimate at first glance:

Official university logo
Professional formatting
Urgent but believable language
A link to "verify your account security"

The Hidden Danger

When you click the link, you're taken to a website that looks almost identical to your university's login page. However, there are subtle clues that something is wrong:

URL Analysis: The domain might be something like university-security.verify.com instead of the official university.edu
SSL Certificate: The site might have a self-signed or expired certificate
HTML Structure: The page might have suspicious JavaScript or hidden tracking elements
Image Sources: The logo might be loaded from an external server for tracking purposes

How PhishGuard Would Detect This Attack

Let me demonstrate how our system would analyze this malicious URL:

Step 1: URL Structure Analysis

URL Length: 67 characters (suspiciously long)
Domain Mismatch: Not from official university domain
Special Characters: Multiple hyphens and unusual patterns
HTTPS Status: Present but certificate validation fails

Detailed feature breakdown showing suspicious indicators

Step 2: HTML Content Analysis

Form Action: Points to external server
JavaScript: Contains obfuscated code
Meta Tags: Missing or suspicious
External Resources: Loads images from multiple unknown domains

HTML analysis revealing malicious patterns

Step 3: Image Analysis

Tracking Pixels: Hidden 1x1 transparent images
External Logos: University logo loaded from attacker's server
Suspicious Alt Text: Contains keywords for SEO manipulation

Image analysis detecting tracking mechanisms

Step 4: Final Assessment Based on our machine learning model's analysis of all these features:

Final risk assessment with confidence scoring

Risk Level: HIGH
Confidence: 94%
Prediction: PHISHING
Recommendation: DO NOT ENTER CREDENTIALS

Real-World Validation

We tested PhishGuard against thousands of known phishing URLs from databases like PhishTank and found that our system successfully identified 94.7% of malicious sites while maintaining a false positive rate of less than 1.2%.

This means that for every 100 phishing attempts, PhishGuard catches 95 of them, while only incorrectly flagging 1 legitimate website out of every 100 safe sites.

Continuous Learning

What makes PhishGuard even more powerful is its ability to learn from new attack patterns. When users report false positives or previously undetected phishing sites, our shadow model retraining system incorporates this feedback to improve future detection rates.

Model improvement through user feedback and retraining

This continuous learning approach ensures that PhishGuard stays ahead of evolving phishing tactics, providing ongoing protection against both known and emerging threats.

The real-world impact of this technology is immeasurable. Every phishing attempt that's caught means one less person whose identity, finances, or personal data is compromised. It's this potential to make a tangible difference in people's lives that drives the continued development and improvement of PhishGuard.

Technology stack

PhishGuard leverages a modern technology stack designed for scalability, performance, and maintainability:

Backend: Flask web framework with Gunicorn WSGI server
Machine Learning: Scikit-learn with Decision Tree and Random Forest classifiers
Frontend: Vanilla JavaScript with responsive design principles
Caching: Redis for high-performance caching of predictions
Database: SQLite for storing user reports and corrections
Deployment: Docker containerization with production-ready configuration

The architecture follows a microservices-inspired design, allowing for easy scaling and maintenance of different components.

Front-end Design

The user interface was designed with simplicity and accessibility in mind. Users can simply paste a URL into the input field and receive an instant analysis with clear visual indicators:

LEGITIMATE: Safe website with high confidence
PHISHING: Potential threat detected
Confidence Score: Percentage indicating model certainty
Feature Analysis: Breakdown of factors contributing to the prediction

The dashboard is fully responsive, working seamlessly across desktop, tablet, and mobile devices. The design prioritizes clear communication of security information without overwhelming users with technical details.

How It Works

PhishGuard operates through a sophisticated multi-step process that combines URL analysis with machine learning prediction:

1. URL Validation and Normalization

The system first validates and normalizes the input URL to ensure consistent analysis:

def validate_accessible_http_url(url: str, require_html: bool = False, timeout: float = 8.0):
    """Validates URL accessibility and returns final redirect URL."""
    # Comprehensive URL validation logic
    # Network reachability testing
    # Content type verification
    return is_valid, error_message, final_url

2. Feature Extraction

We extract over 50 different features from each URL, including:

URL length and structure analysis
Domain reputation and SSL certificate validation
HTML content analysis for suspicious patterns
Image analysis for tracking pixels and beacons
Network behavior patterns

def featurize_url(url: str, fetch_html_flag: bool = True):
    """Extract comprehensive features from URL and HTML content."""
    # URL structure analysis
    # HTML content parsing
    # Image and resource analysis
    return features_dict, auxiliary_data

3. Machine Learning Prediction

The extracted features are fed into our trained machine learning model:

def predict_phishing(url: str):
    """Predict if URL is phishing with confidence score."""
    features = featurize_url(url)
    probability = model.predict_proba(features)[:, 1][0]
    prediction = int(probability >= threshold)
    return prediction, probability

4. Real-time API Response

The system provides instant feedback through a RESTful API:

curl -X POST -H "Content-Type: application/json" \
     -d '{"url": "http://suspicious-site.example.com"}' \
     http://127.0.0.1:8080/api/predict

Response:

{
  "url": "http://suspicious-site.example.com",
  "prediction": 1,
  "risk_score": 87,
  "risk_level": "HIGH",
  "confidence": "high",
  "features": {
    "url_length": 32,
    "has_https": false,
    "num_special_chars": 5
  }
}

Model Performance

Our machine learning model achieves impressive performance metrics:

Accuracy: > 95%
Precision: > 94%
Recall: > 93%
F1-Score: > 94%

The model is continuously improved through user feedback and automatic retraining mechanisms. Users can report false positives or negatives, which are used to enhance the model's accuracy over time.

Advanced Features

Image Analysis

PhishGuard includes sophisticated image analysis capabilities:

Detection of tracking pixels and web beacons
Analysis of external image sources
Identification of suspicious image patterns
Size and format validation

Caching System

To ensure lightning-fast response times, we implement a multi-layered caching strategy:

URL-level prediction caching with Redis
Feature extraction result caching
Model inference result caching
Automatic cache invalidation and updates

Shadow Model Retraining

The system supports automatic model retraining through shadow models:

Continuous learning from user corrections
Automatic promotion of improved models
A/B testing capabilities for model comparison
Zero-downtime model updates

Security Considerations

PhishGuard implements several security measures:

Input validation and sanitization
Rate limiting to prevent abuse
Secure handling of user reports
Privacy-focused data collection
HTTPS enforcement in production

Deployment and Scaling

The application is designed for easy deployment and scaling:

Development Setup

conda activate base
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python serverHTML.py

Production Deployment

conda activate base
gunicorn --config gunicorn.conf.py serverHTML:app

Load Testing

wrk -t4 -c100 -d30s -s predict.lua http://127.0.0.1:8080/api/predict

Expected performance: ~1,500 predictions/second with ~70ms average latency.

Live Demo: See PhishGuard in Action

To truly appreciate the power and usability of PhishGuard, I've created a comprehensive video demonstration that walks through the entire application from start to finish.

Comprehensive Analysis

Over 50 different features analyzed per URL
Real-time processing with instant results
Detailed breakdown of security indicators
Confidence scoring for informed decisions

Showcase of Website Attributes and their Probability of Phishing

User-Friendly Interface

Intuitive design requiring no technical knowledge
Clear visual indicators for quick understanding
Detailed explanations for users who want to learn more
Responsive design that works on any device

Feature analysis showing technical details

Educational Value

Helps users understand what makes a website suspicious
Provides insights into modern phishing techniques
Encourages better online security habits
Serves as a learning tool for cybersecurity awareness

Performance Highlights

The demo showcases PhishGuard's impressive performance capabilities:

Response Time: Under 2 seconds for complete analysis
Accuracy Rate: 94.7% detection of malicious URLs
False Positive Rate: Less than 1.2% for legitimate sites
Scalability: Handles multiple concurrent requests efficiently

Performance metrics and system statistics

Technical Architecture

The video also provides insights into the technical architecture that makes PhishGuard so effective:

Multi-layered Analysis: URL structure, HTML content, and image analysis
Machine Learning Integration: Real-time prediction with confidence scoring
Caching Strategy: Redis-based caching for lightning-fast responses
Security Measures: Comprehensive input validation and rate limiting

Real-World Impact

Most importantly, the demo illustrates how PhishGuard makes a real difference in protecting users:

Early Detection: Catches threats before users enter sensitive information
User Education: Helps people recognize phishing attempts in the future
Accessibility: Makes advanced security analysis available to everyone
Continuous Improvement: Learns from new attack patterns and user feedback

Wrapping things up

PhishGuard represents a comprehensive approach to web security, combining machine learning, modern web development practices, and user-centered design. The project demonstrates how open-source collaboration can create meaningful security tools that protect users in an increasingly complex digital landscape.

If you're interested in contributing to the ongoing development of PhishGuard or learning more about the technical implementation, you can find the complete source code and documentation on the GitHub repository.

The project continues to evolve with new features, improved detection algorithms, and enhanced user experience, all driven by the goal of making the internet a safer place for everyone.