Executive Summary
Developed and deployed a production machine learning system to detect bot-generated email addresses in real-time during SSO authentication events. The solution achieved 92% accuracy with 94% precision, effectively blocking coordinated bot registration campaigns while maintaining low false-positive rates for legitimate users.
Key Achievement: Successfully operationalized a machine learning model from research to production, including custom tooling development for real-time feature engineering and inference at authentication time.
Business Problem
The organization faced persistent automated account creation attacks where adversaries used bot-generated email patterns to create thousands of fake accounts. These attacks had multiple impacts:
- Pollution of user databases with fake accounts
- Resource consumption from fraudulent registrations
- Potential for coordinated abuse campaigns
- Manual investigation overhead for security analysts
Traditional rule-based detection was insufficient—attackers adapted patterns faster than rules could be updated. A machine learning approach was needed to identify structural patterns in bot-generated emails that humans might miss.
Technical Approach
Feature Engineering
Rather than relying on simple heuristics, I engineered features that captured the structural and statistical patterns in email addresses:
- Character-level n-grams (TFIDF): Captured recurring character sequences typical of bot-generated strings (e.g., random consonant clusters, keyboard walks)
- Statistical features: Digit ratios, vowel distributions, character entropy, and length metrics
- Domain characteristics: TLD patterns, subdomain presence, disposable email service indicators
- Dimensionality reduction (PCA): Compressed high-dimensional TFIDF features while retaining discriminative power
Model Development
Built a logistic regression classifier trained on labeled historical data. The model learned to distinguish between organic email patterns (user-chosen addresses) and programmatically generated ones. Feature selection and hyperparameter tuning focused on maintaining high precision to minimize false positives impacting legitimate users.
Production Operationalization
The most challenging aspect was bridging the gap between offline model training and real-time production inference. The model was trained on pre-computed features, but production systems only had raw email strings at authentication time.
Solution: Developed a custom command (emailaddressfeatures) that
performs real-time feature extraction on streaming authentication events. This command:
- Ingests raw email addresses from SSO events
- Calculates all required features (n-grams, statistical metrics, domain properties) on-the-fly
- Integrates seamlessly into the existing security monitoring pipeline
- Maintains sub-second latency to avoid impacting authentication flows
Deployment Architecture
The solution operates as part of the real-time security monitoring stack:
- SSO authentication events stream into the security data platform
- Custom feature extraction command processes each registration event
- Pre-trained TFIDF vectorizer and PCA models transform features
- Logistic regression model scores the email address
- High-risk scores trigger automated blocking and alert generation
Business Impact
- Automated Detection: Eliminated manual review of suspicious registration patterns
- Scalable Defense: Handles thousands of registrations per hour with consistent accuracy
- Adaptive Learning: Model can be retrained on new attack patterns as they emerge
- Low False Positives: 94% precision ensures legitimate users are rarely impacted
Technical Skills Demonstrated
- End-to-end ML pipeline development (data collection, feature engineering, training, deployment)
- Production ML operationalization and real-time inference systems
- Custom tooling development for security automation
- Security data analysis and threat pattern recognition
- Performance optimization for low-latency security decisions
Key Takeaway: This project demonstrates ability to move beyond proof-of-concept ML models to production-grade security systems. The challenge wasn't just building an accurate model—it was engineering the infrastructure to deploy it reliably at scale in a security-critical authentication flow.
Future Enhancements
Potential improvements identified during development:
- Incorporate behavioral features (registration velocity, IP reputation, device fingerprinting)
- Implement online learning to adapt to evolving bot patterns without full retraining
- Add explainability layer to surface which features triggered high-risk scores
- Expand to other authentication vectors beyond email (username patterns, etc.)