PhishGuard: A Machine Learning-powered Phishing Detection System
A real-time phishing URL detection system using ML and modern web technologies.
Brendan Lambrecht

The Origin Story
The journey of PhishGuard began in August of 2025. As a student at UW-La Crosse, I would constantly get bombarded with phishing emails. Every week, students and faculty fall victim to increasingly sophisticated scams, losing access to their accounts and compromising sensitive data.

A professor could receive an email that appeared to be from our university's IT department, complete with official logos and convincing language. The email could have a link to "verify account security" - a classic phishing tactic. Despite their security training, the professor could click the link and enter their credentials, compromising years of valuable data and potentially their careers.
This experience sparked a question: Could we build a tool that would have caught this attack before it caused damage?
The Chrome Extension Challenge
Along with my Partner in this project, Zach Ydunate, our initial approach was to create a Chrome extension that would analyze URLs in real-time as users browsed the web. I spent weeks learning about browser extension development, studying the Chrome API, and designing a system with Zach that could intercept and analyze web requests.

However, I quickly realized the limitations of this approach:
- Browser-specific constraints: Each browser has different extension APIs
- Limited analysis capabilities: Extensions have restricted access to system resources
- User adoption barriers: Users must install and trust the extension
- Maintenance overhead: Browser updates frequently break extension functionality
The Pivot to Full-Stack Development
Recognizing these challenges, I made a strategic pivot. Instead of building a browser extension, I decided to create a comprehensive web application that could be accessed by anyone, anywhere, without installation requirements.
This decision led me down the path of full-stack development, where I had to learn:
- Backend development with Flask and Python
- Frontend design with modern JavaScript and responsive frameworks
- Database management for storing user reports and model training data
- Machine learning for creating accurate phishing detection models

The Machine Learning Breakthrough
The most challenging and rewarding aspect of the project was developing the machine learning model. I spent months researching existing phishing detection techniques, studying academic papers, and experimenting with different algorithms.
The breakthrough came when I discovered the PhiUSIIL Phishing URL Dataset - a comprehensive collection of over 1 million URLs with detailed feature extraction. This dataset became the foundation for training our model.

Through iterative testing and refinement, I developed a system that could analyze over 50 different features from URLs and HTML content, achieving impressive accuracy rates:
- Accuracy: > 95%
- Precision: > 94%
- Recall: > 93%
- F1-Score: > 94%
From Prototype to Production
What started as a simple idea to help protect my campus community evolved into a sophisticated web application used by people around the world. The journey taught me invaluable lessons about:
- User-centered design: Creating tools that are both powerful and accessible
- Technical problem-solving: Overcoming challenges through research and iteration
- Security awareness: Understanding the evolving landscape of cyber threats
- Open-source collaboration: Building something that benefits the broader community
The best security tools are those that make protection accessible to everyone, not just experts.
PhishGuard represents more than just a technical achievement - it's a testament to how real-world problems can inspire meaningful innovation. The project continues to evolve, with new features and improvements driven by user feedback and the ever-changing landscape of cybersecurity threats.
A Real Phishing Attack: The Perfect Case Study
To truly understand the importance of PhishGuard, let me walk you through a real phishing attack that demonstrates exactly what our system is designed to detect.
The Attack Scenario
Imagine you're a university student checking your email. You receive a message that appears to be from your university's IT department:

The email looks legitimate at first glance:
- Official university logo
- Professional formatting
- Urgent but believable language
- A link to "verify your account security"
The Hidden Danger
When you click the link, you're taken to a website that looks almost identical to your university's login page. However, there are subtle clues that something is wrong:
- URL Analysis: The domain might be something like
university-security.verify.cominstead of the officialuniversity.edu - SSL Certificate: The site might have a self-signed or expired certificate
- HTML Structure: The page might have suspicious JavaScript or hidden tracking elements
- Image Sources: The logo might be loaded from an external server for tracking purposes
How PhishGuard Would Detect This Attack
Let me demonstrate how our system would analyze this malicious URL:

Step 1: URL Structure Analysis
- URL Length: 67 characters (suspiciously long)
- Domain Mismatch: Not from official university domain
- Special Characters: Multiple hyphens and unusual patterns
- HTTPS Status: Present but certificate validation fails

Step 2: HTML Content Analysis
- Form Action: Points to external server
- JavaScript: Contains obfuscated code
- Meta Tags: Missing or suspicious
- External Resources: Loads images from multiple unknown domains

Step 3: Image Analysis
- Tracking Pixels: Hidden 1x1 transparent images
- External Logos: University logo loaded from attacker's server
- Suspicious Alt Text: Contains keywords for SEO manipulation

Step 4: Final Assessment Based on our machine learning model's analysis of all these features:

- Risk Level: HIGH
- Confidence: 94%
- Prediction: PHISHING
- Recommendation: DO NOT ENTER CREDENTIALS
Real-World Validation
We tested PhishGuard against thousands of known phishing URLs from databases like PhishTank and found that our system successfully identified 94.7% of malicious sites while maintaining a false positive rate of less than 1.2%.
This means that for every 100 phishing attempts, PhishGuard catches 95 of them, while only incorrectly flagging 1 legitimate website out of every 100 safe sites.
Continuous Learning
What makes PhishGuard even more powerful is its ability to learn from new attack patterns. When users report false positives or previously undetected phishing sites, our shadow model retraining system incorporates this feedback to improve future detection rates.

This continuous learning approach ensures that PhishGuard stays ahead of evolving phishing tactics, providing ongoing protection against both known and emerging threats.
The real-world impact of this technology is immeasurable. Every phishing attempt that's caught means one less person whose identity, finances, or personal data is compromised. It's this potential to make a tangible difference in people's lives that drives the continued development and improvement of PhishGuard.
Technology stack
PhishGuard leverages a modern technology stack designed for scalability, performance, and maintainability:
- Backend: Flask web framework with Gunicorn WSGI server
- Machine Learning: Scikit-learn with Decision Tree and Random Forest classifiers
- Frontend: Vanilla JavaScript with responsive design principles
- Caching: Redis for high-performance caching of predictions
- Database: SQLite for storing user reports and corrections
- Deployment: Docker containerization with production-ready configuration
The architecture follows a microservices-inspired design, allowing for easy scaling and maintenance of different components.
Front-end Design
The user interface was designed with simplicity and accessibility in mind. Users can simply paste a URL into the input field and receive an instant analysis with clear visual indicators:
- LEGITIMATE: Safe website with high confidence
- PHISHING: Potential threat detected
- Confidence Score: Percentage indicating model certainty
- Feature Analysis: Breakdown of factors contributing to the prediction
The dashboard is fully responsive, working seamlessly across desktop, tablet, and mobile devices. The design prioritizes clear communication of security information without overwhelming users with technical details.
How It Works
PhishGuard operates through a sophisticated multi-step process that combines URL analysis with machine learning prediction:
1. URL Validation and Normalization
The system first validates and normalizes the input URL to ensure consistent analysis:
def validate_accessible_http_url(url: str, require_html: bool = False, timeout: float = 8.0):
"""Validates URL accessibility and returns final redirect URL."""
# Comprehensive URL validation logic
# Network reachability testing
# Content type verification
return is_valid, error_message, final_url
2. Feature Extraction
We extract over 50 different features from each URL, including:
- URL length and structure analysis
- Domain reputation and SSL certificate validation
- HTML content analysis for suspicious patterns
- Image analysis for tracking pixels and beacons
- Network behavior patterns
def featurize_url(url: str, fetch_html_flag: bool = True):
"""Extract comprehensive features from URL and HTML content."""
# URL structure analysis
# HTML content parsing
# Image and resource analysis
return features_dict, auxiliary_data
3. Machine Learning Prediction
The extracted features are fed into our trained machine learning model:
def predict_phishing(url: str):
"""Predict if URL is phishing with confidence score."""
features = featurize_url(url)
probability = model.predict_proba(features)[:, 1][0]
prediction = int(probability >= threshold)
return prediction, probability
4. Real-time API Response
The system provides instant feedback through a RESTful API:
curl -X POST -H "Content-Type: application/json" \
-d '{"url": "http://suspicious-site.example.com"}' \
http://127.0.0.1:8080/api/predict
Response:
{
"url": "http://suspicious-site.example.com",
"prediction": 1,
"risk_score": 87,
"risk_level": "HIGH",
"confidence": "high",
"features": {
"url_length": 32,
"has_https": false,
"num_special_chars": 5
}
}
Model Performance
Our machine learning model achieves impressive performance metrics:
- Accuracy: > 95%
- Precision: > 94%
- Recall: > 93%
- F1-Score: > 94%
The model is continuously improved through user feedback and automatic retraining mechanisms. Users can report false positives or negatives, which are used to enhance the model's accuracy over time.
Advanced Features
Image Analysis
PhishGuard includes sophisticated image analysis capabilities:
- Detection of tracking pixels and web beacons
- Analysis of external image sources
- Identification of suspicious image patterns
- Size and format validation
Caching System
To ensure lightning-fast response times, we implement a multi-layered caching strategy:
- URL-level prediction caching with Redis
- Feature extraction result caching
- Model inference result caching
- Automatic cache invalidation and updates
Shadow Model Retraining
The system supports automatic model retraining through shadow models:
- Continuous learning from user corrections
- Automatic promotion of improved models
- A/B testing capabilities for model comparison
- Zero-downtime model updates
Security Considerations
PhishGuard implements several security measures:
- Input validation and sanitization
- Rate limiting to prevent abuse
- Secure handling of user reports
- Privacy-focused data collection
- HTTPS enforcement in production
Deployment and Scaling
The application is designed for easy deployment and scaling:
Development Setup
conda activate base
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python serverHTML.py
Production Deployment
conda activate base
gunicorn --config gunicorn.conf.py serverHTML:app
Load Testing
wrk -t4 -c100 -d30s -s predict.lua http://127.0.0.1:8080/api/predict
Expected performance: ~1,500 predictions/second with ~70ms average latency.
Live Demo: See PhishGuard in Action
To truly appreciate the power and usability of PhishGuard, I've created a comprehensive video demonstration that walks through the entire application from start to finish.
Comprehensive Analysis
- Over 50 different features analyzed per URL
- Real-time processing with instant results
- Detailed breakdown of security indicators
- Confidence scoring for informed decisions

User-Friendly Interface
- Intuitive design requiring no technical knowledge
- Clear visual indicators for quick understanding
- Detailed explanations for users who want to learn more
- Responsive design that works on any device

Educational Value
- Helps users understand what makes a website suspicious
- Provides insights into modern phishing techniques
- Encourages better online security habits
- Serves as a learning tool for cybersecurity awareness
Performance Highlights
The demo showcases PhishGuard's impressive performance capabilities:
- Response Time: Under 2 seconds for complete analysis
- Accuracy Rate: 94.7% detection of malicious URLs
- False Positive Rate: Less than 1.2% for legitimate sites
- Scalability: Handles multiple concurrent requests efficiently

Technical Architecture
The video also provides insights into the technical architecture that makes PhishGuard so effective:
- Multi-layered Analysis: URL structure, HTML content, and image analysis
- Machine Learning Integration: Real-time prediction with confidence scoring
- Caching Strategy: Redis-based caching for lightning-fast responses
- Security Measures: Comprehensive input validation and rate limiting
Real-World Impact
Most importantly, the demo illustrates how PhishGuard makes a real difference in protecting users:
- Early Detection: Catches threats before users enter sensitive information
- User Education: Helps people recognize phishing attempts in the future
- Accessibility: Makes advanced security analysis available to everyone
- Continuous Improvement: Learns from new attack patterns and user feedback
Wrapping things up
PhishGuard represents a comprehensive approach to web security, combining machine learning, modern web development practices, and user-centered design. The project demonstrates how open-source collaboration can create meaningful security tools that protect users in an increasingly complex digital landscape.
If you're interested in contributing to the ongoing development of PhishGuard or learning more about the technical implementation, you can find the complete source code and documentation on the GitHub repository.
The project continues to evolve with new features, improved detection algorithms, and enhanced user experience, all driven by the goal of making the internet a safer place for everyone.