PhishGuard: A Machine Learning-powered Phishing Detection System

A real-time phishing URL detection system using ML and modern web technologies.

Brendan Lambrecht

PhishGuard: A Machine Learning-powered Phishing Detection System blog image

The Origin Story

The journey of PhishGuard began in August of 2025. As a student at UW-La Crosse, I would constantly get bombarded with phishing emails. Every week, students and faculty fall victim to increasingly sophisticated scams, losing access to their accounts and compromising sensitive data.

IT support environment where the project originated

A professor could receive an email that appeared to be from our university's IT department, complete with official logos and convincing language. The email could have a link to "verify account security" - a classic phishing tactic. Despite their security training, the professor could click the link and enter their credentials, compromising years of valuable data and potentially their careers.

This experience sparked a question: Could we build a tool that would have caught this attack before it caused damage?

The Chrome Extension Challenge

Along with my Partner in this project, Zach Ydunate, our initial approach was to create a Chrome extension that would analyze URLs in real-time as users browsed the web. I spent weeks learning about browser extension development, studying the Chrome API, and designing a system with Zach that could intercept and analyze web requests.

Chrome extension architecture design

However, I quickly realized the limitations of this approach:

  1. Browser-specific constraints: Each browser has different extension APIs
  2. Limited analysis capabilities: Extensions have restricted access to system resources
  3. User adoption barriers: Users must install and trust the extension
  4. Maintenance overhead: Browser updates frequently break extension functionality

The Pivot to Full-Stack Development

Recognizing these challenges, I made a strategic pivot. Instead of building a browser extension, I decided to create a comprehensive web application that could be accessed by anyone, anywhere, without installation requirements.

This decision led me down the path of full-stack development, where I had to learn:

  • Backend development with Flask and Python
  • Frontend design with modern JavaScript and responsive frameworks
  • Database management for storing user reports and model training data
  • Machine learning for creating accurate phishing detection models
Full-stack architecture of PhishGuard

The Machine Learning Breakthrough

The most challenging and rewarding aspect of the project was developing the machine learning model. I spent months researching existing phishing detection techniques, studying academic papers, and experimenting with different algorithms.

The breakthrough came when I discovered the PhiUSIIL Phishing URL Dataset - a comprehensive collection of over 1 million URLs with detailed feature extraction. This dataset became the foundation for training our model.

Machine learning model development process

Through iterative testing and refinement, I developed a system that could analyze over 50 different features from URLs and HTML content, achieving impressive accuracy rates:

  • Accuracy: > 95%
  • Precision: > 94%
  • Recall: > 93%
  • F1-Score: > 94%

From Prototype to Production

What started as a simple idea to help protect my campus community evolved into a sophisticated web application used by people around the world. The journey taught me invaluable lessons about:

  • User-centered design: Creating tools that are both powerful and accessible
  • Technical problem-solving: Overcoming challenges through research and iteration
  • Security awareness: Understanding the evolving landscape of cyber threats
  • Open-source collaboration: Building something that benefits the broader community

The best security tools are those that make protection accessible to everyone, not just experts.

PhishGuard represents more than just a technical achievement - it's a testament to how real-world problems can inspire meaningful innovation. The project continues to evolve, with new features and improvements driven by user feedback and the ever-changing landscape of cybersecurity threats.

A Real Phishing Attack: The Perfect Case Study

To truly understand the importance of PhishGuard, let me walk you through a real phishing attack that demonstrates exactly what our system is designed to detect.

The Attack Scenario

Imagine you're a university student checking your email. You receive a message that appears to be from your university's IT department:

Example of a phishing email targeting university students

The email looks legitimate at first glance:

  • Official university logo
  • Professional formatting
  • Urgent but believable language
  • A link to "verify your account security"

The Hidden Danger

When you click the link, you're taken to a website that looks almost identical to your university's login page. However, there are subtle clues that something is wrong:

  1. URL Analysis: The domain might be something like university-security.verify.com instead of the official university.edu
  2. SSL Certificate: The site might have a self-signed or expired certificate
  3. HTML Structure: The page might have suspicious JavaScript or hidden tracking elements
  4. Image Sources: The logo might be loaded from an external server for tracking purposes

How PhishGuard Would Detect This Attack

Let me demonstrate how our system would analyze this malicious URL:

PhishGuard analysis of a phishing URL

Step 1: URL Structure Analysis

  • URL Length: 67 characters (suspiciously long)
  • Domain Mismatch: Not from official university domain
  • Special Characters: Multiple hyphens and unusual patterns
  • HTTPS Status: Present but certificate validation fails
Detailed feature breakdown showing suspicious indicators

Step 2: HTML Content Analysis

  • Form Action: Points to external server
  • JavaScript: Contains obfuscated code
  • Meta Tags: Missing or suspicious
  • External Resources: Loads images from multiple unknown domains
HTML analysis revealing malicious patterns

Step 3: Image Analysis

  • Tracking Pixels: Hidden 1x1 transparent images
  • External Logos: University logo loaded from attacker's server
  • Suspicious Alt Text: Contains keywords for SEO manipulation
Image analysis detecting tracking mechanisms

Step 4: Final Assessment Based on our machine learning model's analysis of all these features:

Final risk assessment with confidence scoring
  • Risk Level: HIGH
  • Confidence: 94%
  • Prediction: PHISHING
  • Recommendation: DO NOT ENTER CREDENTIALS

Real-World Validation

We tested PhishGuard against thousands of known phishing URLs from databases like PhishTank and found that our system successfully identified 94.7% of malicious sites while maintaining a false positive rate of less than 1.2%.

This means that for every 100 phishing attempts, PhishGuard catches 95 of them, while only incorrectly flagging 1 legitimate website out of every 100 safe sites.

Continuous Learning

What makes PhishGuard even more powerful is its ability to learn from new attack patterns. When users report false positives or previously undetected phishing sites, our shadow model retraining system incorporates this feedback to improve future detection rates.

Model improvement through user feedback and retraining

This continuous learning approach ensures that PhishGuard stays ahead of evolving phishing tactics, providing ongoing protection against both known and emerging threats.

The real-world impact of this technology is immeasurable. Every phishing attempt that's caught means one less person whose identity, finances, or personal data is compromised. It's this potential to make a tangible difference in people's lives that drives the continued development and improvement of PhishGuard.

Technology stack

PhishGuard leverages a modern technology stack designed for scalability, performance, and maintainability:

  • Backend: Flask web framework with Gunicorn WSGI server
  • Machine Learning: Scikit-learn with Decision Tree and Random Forest classifiers
  • Frontend: Vanilla JavaScript with responsive design principles
  • Caching: Redis for high-performance caching of predictions
  • Database: SQLite for storing user reports and corrections
  • Deployment: Docker containerization with production-ready configuration

The architecture follows a microservices-inspired design, allowing for easy scaling and maintenance of different components.

Front-end Design

The user interface was designed with simplicity and accessibility in mind. Users can simply paste a URL into the input field and receive an instant analysis with clear visual indicators:

  • LEGITIMATE: Safe website with high confidence
  • PHISHING: Potential threat detected
  • Confidence Score: Percentage indicating model certainty
  • Feature Analysis: Breakdown of factors contributing to the prediction

The dashboard is fully responsive, working seamlessly across desktop, tablet, and mobile devices. The design prioritizes clear communication of security information without overwhelming users with technical details.

How It Works

PhishGuard operates through a sophisticated multi-step process that combines URL analysis with machine learning prediction:

1. URL Validation and Normalization

The system first validates and normalizes the input URL to ensure consistent analysis:

def validate_accessible_http_url(url: str, require_html: bool = False, timeout: float = 8.0):
    """Validates URL accessibility and returns final redirect URL."""
    # Comprehensive URL validation logic
    # Network reachability testing
    # Content type verification
    return is_valid, error_message, final_url

2. Feature Extraction

We extract over 50 different features from each URL, including:

  • URL length and structure analysis
  • Domain reputation and SSL certificate validation
  • HTML content analysis for suspicious patterns
  • Image analysis for tracking pixels and beacons
  • Network behavior patterns
def featurize_url(url: str, fetch_html_flag: bool = True):
    """Extract comprehensive features from URL and HTML content."""
    # URL structure analysis
    # HTML content parsing
    # Image and resource analysis
    return features_dict, auxiliary_data

3. Machine Learning Prediction

The extracted features are fed into our trained machine learning model:

def predict_phishing(url: str):
    """Predict if URL is phishing with confidence score."""
    features = featurize_url(url)
    probability = model.predict_proba(features)[:, 1][0]
    prediction = int(probability >= threshold)
    return prediction, probability

4. Real-time API Response

The system provides instant feedback through a RESTful API:

curl -X POST -H "Content-Type: application/json" \
     -d '{"url": "http://suspicious-site.example.com"}' \
     http://127.0.0.1:8080/api/predict

Response:

{
  "url": "http://suspicious-site.example.com",
  "prediction": 1,
  "risk_score": 87,
  "risk_level": "HIGH",
  "confidence": "high",
  "features": {
    "url_length": 32,
    "has_https": false,
    "num_special_chars": 5
  }
}

Model Performance

Our machine learning model achieves impressive performance metrics:

  • Accuracy: > 95%
  • Precision: > 94%
  • Recall: > 93%
  • F1-Score: > 94%

The model is continuously improved through user feedback and automatic retraining mechanisms. Users can report false positives or negatives, which are used to enhance the model's accuracy over time.

Advanced Features

Image Analysis

PhishGuard includes sophisticated image analysis capabilities:

  • Detection of tracking pixels and web beacons
  • Analysis of external image sources
  • Identification of suspicious image patterns
  • Size and format validation

Caching System

To ensure lightning-fast response times, we implement a multi-layered caching strategy:

  • URL-level prediction caching with Redis
  • Feature extraction result caching
  • Model inference result caching
  • Automatic cache invalidation and updates

Shadow Model Retraining

The system supports automatic model retraining through shadow models:

  • Continuous learning from user corrections
  • Automatic promotion of improved models
  • A/B testing capabilities for model comparison
  • Zero-downtime model updates

Security Considerations

PhishGuard implements several security measures:

  • Input validation and sanitization
  • Rate limiting to prevent abuse
  • Secure handling of user reports
  • Privacy-focused data collection
  • HTTPS enforcement in production

Deployment and Scaling

The application is designed for easy deployment and scaling:

Development Setup

conda activate base
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python serverHTML.py

Production Deployment

conda activate base
gunicorn --config gunicorn.conf.py serverHTML:app

Load Testing

wrk -t4 -c100 -d30s -s predict.lua http://127.0.0.1:8080/api/predict

Expected performance: ~1,500 predictions/second with ~70ms average latency.

Live Demo: See PhishGuard in Action

To truly appreciate the power and usability of PhishGuard, I've created a comprehensive video demonstration that walks through the entire application from start to finish.

Comprehensive Analysis

  • Over 50 different features analyzed per URL
  • Real-time processing with instant results
  • Detailed breakdown of security indicators
  • Confidence scoring for informed decisions
Showcase of Website Attributes and their Probability of Phishing

User-Friendly Interface

  • Intuitive design requiring no technical knowledge
  • Clear visual indicators for quick understanding
  • Detailed explanations for users who want to learn more
  • Responsive design that works on any device
Feature analysis showing technical details

Educational Value

  • Helps users understand what makes a website suspicious
  • Provides insights into modern phishing techniques
  • Encourages better online security habits
  • Serves as a learning tool for cybersecurity awareness

Performance Highlights

The demo showcases PhishGuard's impressive performance capabilities:

  • Response Time: Under 2 seconds for complete analysis
  • Accuracy Rate: 94.7% detection of malicious URLs
  • False Positive Rate: Less than 1.2% for legitimate sites
  • Scalability: Handles multiple concurrent requests efficiently
Performance metrics and system statistics

Technical Architecture

The video also provides insights into the technical architecture that makes PhishGuard so effective:

  • Multi-layered Analysis: URL structure, HTML content, and image analysis
  • Machine Learning Integration: Real-time prediction with confidence scoring
  • Caching Strategy: Redis-based caching for lightning-fast responses
  • Security Measures: Comprehensive input validation and rate limiting

Real-World Impact

Most importantly, the demo illustrates how PhishGuard makes a real difference in protecting users:

  • Early Detection: Catches threats before users enter sensitive information
  • User Education: Helps people recognize phishing attempts in the future
  • Accessibility: Makes advanced security analysis available to everyone
  • Continuous Improvement: Learns from new attack patterns and user feedback

Wrapping things up

PhishGuard represents a comprehensive approach to web security, combining machine learning, modern web development practices, and user-centered design. The project demonstrates how open-source collaboration can create meaningful security tools that protect users in an increasingly complex digital landscape.

If you're interested in contributing to the ongoing development of PhishGuard or learning more about the technical implementation, you can find the complete source code and documentation on the GitHub repository.

The project continues to evolve with new features, improved detection algorithms, and enhanced user experience, all driven by the goal of making the internet a safer place for everyone.

Tags

  • Machine Learning
  • Web Security
  • Flask
  • JavaScript

Contact

Questions or need more details? Email me or check out my links.