Bot Email Classifier - Project Details

Executive Summary

Developed and deployed a production machine learning system to detect bot-generated email addresses in real-time during SSO authentication events. The solution achieved 92% accuracy with 94% precision, effectively blocking coordinated bot registration campaigns while maintaining low false-positive rates for legitimate users.

Key Achievement: Successfully operationalized a machine learning model from research to production, including custom tooling development for real-time feature engineering and inference at authentication time.

Business Problem

The organization faced persistent automated account creation attacks where adversaries used bot-generated email patterns to create thousands of fake accounts. These attacks had multiple impacts:

Pollution of user databases with fake accounts
Resource consumption from fraudulent registrations
Potential for coordinated abuse campaigns
Manual investigation overhead for security analysts

Traditional rule-based detection was insufficient—attackers adapted patterns faster than rules could be updated. A machine learning approach was needed to identify structural patterns in bot-generated emails that humans might miss.

Technical Approach

Feature Engineering

Rather than relying on simple heuristics, I engineered features that captured the structural and statistical patterns in email addresses:

Character-level n-grams (TFIDF): Captured recurring character sequences typical of bot-generated strings (e.g., random consonant clusters, keyboard walks)
Statistical features: Digit ratios, vowel distributions, character entropy, and length metrics
Domain characteristics: TLD patterns, subdomain presence, disposable email service indicators
Dimensionality reduction (PCA): Compressed high-dimensional TFIDF features while retaining discriminative power

Model Development

Built a logistic regression classifier trained on labeled historical data. The model learned to distinguish between organic email patterns (user-chosen addresses) and programmatically generated ones. Feature selection and hyperparameter tuning focused on maintaining high precision to minimize false positives impacting legitimate users.

91.8%

Accuracy

93.7%

Precision

89.6%

Recall

Production Operationalization

The most challenging aspect was bridging the gap between offline model training and real-time production inference. The model was trained on pre-computed features, but production systems only had raw email strings at authentication time.

Solution: Developed a custom command (emailaddressfeatures) that performs real-time feature extraction on streaming authentication events. This command:

Ingests raw email addresses from SSO events
Calculates all required features (n-grams, statistical metrics, domain properties) on-the-fly
Integrates seamlessly into the existing security monitoring pipeline
Maintains sub-second latency to avoid impacting authentication flows

Deployment Architecture

The solution operates as part of the real-time security monitoring stack:

SSO authentication events stream into the security data platform
Custom feature extraction command processes each registration event
Pre-trained TFIDF vectorizer and PCA models transform features
Logistic regression model scores the email address
High-risk scores trigger automated blocking and alert generation

Business Impact

Automated Detection: Eliminated manual review of suspicious registration patterns
Scalable Defense: Handles thousands of registrations per hour with consistent accuracy
Adaptive Learning: Model can be retrained on new attack patterns as they emerge
Low False Positives: 94% precision ensures legitimate users are rarely impacted

Technical Skills Demonstrated

End-to-end ML pipeline development (data collection, feature engineering, training, deployment)
Production ML operationalization and real-time inference systems
Custom tooling development for security automation
Security data analysis and threat pattern recognition
Performance optimization for low-latency security decisions

Key Takeaway: This project demonstrates ability to move beyond proof-of-concept ML models to production-grade security systems. The challenge wasn't just building an accurate model—it was engineering the infrastructure to deploy it reliably at scale in a security-critical authentication flow.

Future Enhancements

Potential improvements identified during development:

Incorporate behavioral features (registration velocity, IP reputation, device fingerprinting)
Implement online learning to adapt to evolving bot patterns without full retraining
Add explainability layer to surface which features triggered high-risk scores
Expand to other authentication vectors beyond email (username patterns, etc.)