Back to Archive

ML Classifier: Hunting Down Bot Email Addresses

TLP:AMBER November 2025
Machine Learning Pattern Recognition Production ML

Executive Summary

Developed and deployed a production machine learning system to detect bot-generated email addresses in real-time during SSO authentication events. The solution achieved 92% accuracy with 94% precision, effectively blocking coordinated bot registration campaigns while maintaining low false-positive rates for legitimate users.

Key Achievement: Successfully operationalized a machine learning model from research to production, including custom tooling development for real-time feature engineering and inference at authentication time.

Business Problem

The organization faced persistent automated account creation attacks where adversaries used bot-generated email patterns to create thousands of fake accounts. These attacks had multiple impacts:

Traditional rule-based detection was insufficient—attackers adapted patterns faster than rules could be updated. A machine learning approach was needed to identify structural patterns in bot-generated emails that humans might miss.

Technical Approach

Feature Engineering

Rather than relying on simple heuristics, I engineered features that captured the structural and statistical patterns in email addresses:

Model Development

Built a logistic regression classifier trained on labeled historical data. The model learned to distinguish between organic email patterns (user-chosen addresses) and programmatically generated ones. Feature selection and hyperparameter tuning focused on maintaining high precision to minimize false positives impacting legitimate users.

91.8%
Accuracy
93.7%
Precision
89.6%
Recall

Production Operationalization

The most challenging aspect was bridging the gap between offline model training and real-time production inference. The model was trained on pre-computed features, but production systems only had raw email strings at authentication time.

Solution: Developed a custom command (emailaddressfeatures) that performs real-time feature extraction on streaming authentication events. This command:

Deployment Architecture

The solution operates as part of the real-time security monitoring stack:

Business Impact

Technical Skills Demonstrated

Key Takeaway: This project demonstrates ability to move beyond proof-of-concept ML models to production-grade security systems. The challenge wasn't just building an accurate model—it was engineering the infrastructure to deploy it reliably at scale in a security-critical authentication flow.

Future Enhancements

Potential improvements identified during development: