Good Bots vs Bad Bots: ML Bot Detection Explained

The Bot Landscape: By the Numbers

According to Zlycloud's 2026 Bot Traffic Report, analyzing over 3 billion daily requests across our network:

47% of internet traffic is automated (bot-generated), up from 42% in 2023
30% of all traffic is malicious bots — scrapers, credential stuffers, vulnerability scanners, and DDoS bots
17% is beneficial automation — search crawlers, monitoring tools, SEO tools, RSS readers
53% is human-generated — the traffic you're actually building your application for

Malicious bot traffic costs businesses $116 billion annually in infrastructure costs, content theft, competitive intelligence leakage, and direct fraud. E-commerce sites experience the worst of it: 24% of all login attempts on retail sites are credential stuffing attacks.

Why Bot Traffic Is Growing

AI-powered bot frameworks (Playwright-based bots with natural language instruction sets) have dramatically lowered the skill barrier for bot operations. What previously required custom coding now takes minutes to configure. Botnet-as-a-service markets offer residential proxy pools, CAPTCHA-solving services, and browser automation — for as little as $50/month.

Good Bots: Identifying and Whitelisting Beneficial Automation

Not all automated traffic should be blocked. Search engine crawlers, uptime monitors, and SEO tools provide genuine value. Blocking Googlebot means your pages disappear from search results; blocking your own monitoring means you're blind to outages.

Major Good Bot Categories

Search Engine Crawlers

Googlebot, Bingbot, DuckDuckBot, Baiduspider, Yandexbot — index your content for search ranking. Should always be allowed.

Identification: Reverse DNS verification (IP resolves to googlebot.com, then forward resolves back)

Uptime & Performance Monitors

Pingdom, UptimeRobot, StatusCake, Datadog Synthetic, New Relic Synthetics. Known IP ranges, consistent User-Agents.

SEO & Analytics Crawlers

Semrush, Ahrefs, Moz — crawl for backlink analysis and SEO auditing. Often aggressive crawl rates; rate limiting appropriate but not blocking.

Security Scanners (Authorized)

Your own Qualys, Nessus, or Burp Suite scans. Whitelist by IP to prevent WAF blocking your own security testing.

Verifying Good Bot Identity

User-Agent strings are trivially spoofed. A bot claiming to be Googlebot should be verified through reverse DNS lookup: the source IP must resolve to a hostname ending in googlebot.com or google.com, and that hostname must forward-resolve back to the same IP. This double-verification catches IP spoofing and User-Agent impersonation simultaneously.

# Verify Googlebot authenticity
$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

$ host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

# Both resolve ✓ — legitimate Googlebot
# Mismatch → spoofed bot claiming to be Googlebot

Bad Bot Categories: A Taxonomy

Content and Price Scrapers

Scrapers systematically extract content — product listings, prices, inventory levels, news articles, job postings — to republish it, use it for competitive intelligence, or train AI models. Modern scrapers rotate User-Agents, use residential proxies, implement random delays, and execute JavaScript to handle SPAs.

E-commerce operators face price scrapers that monitor pricing in real-time, enabling competitors to undercut by fractions of a cent automatically. At scale, this creates race conditions that compress margins across entire market categories.

Credential Stuffing Bots

Using leaked username/password databases (billions of credentials available on dark web markets), credential stuffing bots attempt authentication against target applications at high velocity. Sophisticated campaigns use:

Residential proxy pools to distribute across millions of IPs (defeating IP-based blocking)
Slow-and-distributed rates (1-2 attempts per IP per hour — below rate limit thresholds)
Valid User-Agent strings and TLS fingerprints matching real browsers
CAPTCHA-solving services (human solvers or ML-based)

Success rates of 0.1-2% sound low but represent millions of compromised accounts when the credential list contains 500 million entries.

Carding Bots

Carding bots test stolen credit card numbers against e-commerce sites, typically using small "micro-authorization" amounts ($0.01 - $5.00) to verify the card is valid before selling it or using it for large purchases. They target sites with simple checkout flows and weak fraud detection. A carding campaign can process 100,000+ card tests per hour against a single merchant.

Scalper Bots (Inventory Hoarding)

Scalper bots monitor limited-inventory products (sneakers, concert tickets, gaming consoles, GPUs) and purchase at the moment of availability — faster than any human can act. They add items to cart, complete checkout including payment form submission, and complete purchases within milliseconds of stock appearing. Secondary market markups of 200-500% on scalped goods are common.

L7 DDoS Bots

As covered in our DDoS mitigation guide, application-layer DDoS attacks use botnets sending legitimate-looking HTTP requests at high volume. The distinction from credential stuffers or scrapers is intent (exhaustion vs theft) and rate (thousands of rps vs a few per second).

Advanced Persistent Bots (APBs): Human Mimicry

The most sophisticated category — Advanced Persistent Bots — are specifically designed to defeat detection systems by mimicking human behavior patterns:

Running in real browser instances (Playwright, Puppeteer) rather than HTTP clients — generating valid JS execution, DOM interaction, and rendering artifacts
Executing realistic mouse movement patterns (Bezier curves, not straight lines)
Typing at variable, human-like speeds with realistic inter-keystroke timing
Spending realistic "reading time" on pages before navigation
Generating valid browser fingerprints (canvas fingerprint, WebGL, audio context)
Using residential proxies with clean IP reputation
Handling CAPTCHAs via human CAPTCHA farms (humans solving at $0.50-2.00/1000)

APBs are used by well-funded scraping operations, nation-state actors, and sophisticated fraud rings. They defeat simple signature-based and threshold-based detection — requiring ML-powered behavioral analysis to identify.

Detection Layer 1: IP Reputation

The first and fastest detection layer scores requests based on the source IP's history and classification:

ASN classification — Traffic from cloud hosting ASNs (AWS, GCP, Azure, DigitalOcean, Hetzner) is highly suspicious for browser-impersonating bots; legitimate users are on ISP ASNs. ASN-level scoring doesn't block (cloud users exist) but adds signal.
Datacenter IP ranges — Published lists of datacenter IP blocks (used by bots for cheap, high-bandwidth compute) score heavily negative.
Tor exit nodes — Tor network exit IPs indicate deliberate anonymization. Legitimate use cases exist, but Tor traffic warrants elevated challenge.
Residential proxy pools — Increasingly sophisticated detection of residential proxy services using behavioral correlation (multiple IPs in different cities exhibiting identical behavior patterns).
Historical threat intelligence — IPs previously associated with attacks, in blocklists, or sharing subnets with known malicious infrastructure.

Detection Layer 2: TLS and Browser Fingerprinting

JA3/JA4 TLS Fingerprinting

Every TLS client produces a fingerprint based on the parameters in its ClientHello message: supported cipher suites, TLS extensions, elliptic curves, and elliptic curve point formats. The JA3 hash (MD5 of these parameters) and its successor JA4 (with improved context and consistency) create stable identifiers for specific TLS implementations.

# JA3 fingerprint components (comma-separated, then MD5-hashed):
TLSVersion,Ciphers,Extensions,EllipticCurves,EllipticCurvePointFormats

# Example legitimate Chrome 121 JA3:
771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,
0-23-65281-10-11-35-16-5-13-18-51-45-43-27-17513-21,
29-23-24,0

# MD5: 66918128f1b9b03303d77c6f2eefd128

# curl/7.88 JA3 — distinctly different cipher ordering:
771,4866-4865-4867-49196-49195-49199-49198-49188-49187-49162-49161-52393-52392-
49172-49171-157-156-61-60-53-47-255,0-11-10-13-16,29-23-24-25,0

# MD5: different hash → NOT matching real Chrome

Bots using standard HTTP libraries (Python's requests, Go's http, Node.js) produce JA3 hashes that don't match any real browser — an immediate high-confidence bot signal. Sophisticated bots use browser automation (which produces valid browser JA3s) or patch their TLS stack to match Chrome's fingerprint.

HTTP/2 Fingerprinting

Complementing TLS fingerprinting, HTTP/2 settings frames reveal the client implementation: the order and values of SETTINGS parameters (header table size, initial window size, max concurrent streams) form a fingerprint that differs between browsers and HTTP libraries.

Browser Environment Fingerprinting

For requests that do execute JavaScript, browser fingerprinting collects dozens of signals:

Navigator properties (platform, language, hardware concurrency)
Canvas rendering fingerprint (GPU/driver-specific rendering differences)
WebGL vendor and renderer strings
Audio context fingerprint (AudioBuffer processing characteristics)
Screen resolution and pixel density
Installed fonts (via CSS timing attacks)
WebDriver presence (navigator.webdriver flag)
Automation-specific browser properties (Chrome's window.chrome object)

Detection Layer 3: Behavioral Analysis

The most sophisticated detection layer analyzes patterns over time and across sessions — finding anomalies that individual request analysis misses:

Mouse and Interaction Analysis

Human mouse movements follow characteristic patterns: acceleration and deceleration, micro-tremors, path curvature influenced by motor physics. Bot-generated mouse movements use linear interpolation (perfectly straight lines) or overly "perfect" Bezier curves. Features analyzed:

Mouse velocity distribution (humans show non-uniform speed profiles)
Direction change frequency and angle distribution
Time-to-click after mouse-stop (humans pause before clicking)
Click pressure and duration (touchscreen devices)
Scroll velocity and acceleration patterns

Request Timing Analysis

Human browsing introduces natural timing variations: reading time before navigation, thinking time before form submission, irregular inter-request intervals. Bot requests exhibit mechanical regularity: identical or algorithmically-varied timing, no correlation between content length and time spent on page.

Session Coherence Analysis

Human sessions tell coherent stories: landing page → browse category → view product → add to cart → checkout. Bot sessions often violate this coherence: direct POST to checkout without visiting the product page, requests in impossible order, missing required cookies that browsers would naturally accumulate.

ML Model Architecture: Training on 3 Billion Daily Samples

Zlycloud's Bot Shield ML model processes 3+ billion requests per day across our global network, using this traffic as continuous training signal. The production model is a gradient-boosted ensemble with the following feature set:

Network features (17): IP reputation score, ASN category, geolocation consistency, proxy/VPN/Tor signal
TLS features (12): JA3/JA4 hash match to known browser fingerprints, TLS version, cipher suite ordering anomaly
HTTP features (28): User-Agent consistency with JA3, HTTP/2 settings fingerprint, header ordering, Accept-Language consistency with IP geolocation
Behavioral features (34): Request rate, session coherence score, timing distribution statistics, mouse movement entropy
Historical features (11): Prior classification history for this IP, client fingerprint's historical behavior, account's risk score

The model outputs a continuous risk score from 0.0 (definitely human) to 1.0 (definitely bot), with configurable thresholds for different response actions. Scores update in real-time as each new request provides additional behavioral evidence within a session.

Challenge Responses: The Spectrum from Transparent to Friction

Challenge Type	Bot Resistance	User Friction	Best Used For
Transparent fingerprinting	Moderate — defeats simple bots	Zero — invisible to users	All traffic, passive scoring
JS Challenge	High — defeats headless HTTP clients	Low — ~100ms added on first visit	Elevated risk score, first visit
Proof-of-Work	High — increases cost per request	Low — 50-200ms compute time	High-value endpoints, DDoS mitigation
CAPTCHA (image recognition)	High — defeats automated solvers	High — annoying, accessibility issues	Last resort, very high risk scores
Shadow ban / honeypot	High — bots don't know they're blocked	None	Scrapers (show plausible but fake data)
Rate limit + 429 response	Low — bots slow down, try again	Low — only affects heavy users	Simple volumetric bot control

Managing False Positives: Protecting Legitimate Users

The hardest challenge in bot detection is minimizing false positives — blocking or challenging legitimate users. False positives have real costs: frustrated users abandon sessions, accessibility tools may be blocked, corporate proxy users appear as bots, and automated testing pipelines break.

"Every bot management system generates false positives. The measure of a good system isn't zero false positives — it's keeping false positive rate below 0.1% while maintaining 99%+ bot detection accuracy. At 3 billion daily requests, even 0.01% false positives means 300,000 wrongly-challenged legitimate users per day."

Strategies for reducing false positives:

Score thresholding: Use challenges (JS challenge, PoW) for medium-confidence detections rather than outright blocking. If a real user is falsely challenged, they pass the JS challenge and continue. Only block at very high confidence scores (0.95+).
Behavioral confirmation: Don't make permanent decisions from a single request. Accumulate evidence across the session before escalating the response.
Accessibility considerations: Users with assistive technologies may have different behavioral patterns. Screen readers navigate differently; motor impairment affects mouse movement. Factor assistive technology signals into scoring.
Allowlisting for known-good clients: API clients with valid API keys, monitoring IPs, and verified search engines bypass behavioral scoring.
Customer feedback loop: Provide mechanisms for users to contest blocks, and use confirmed legitimate-user cases to improve the model.

For organizations securing API endpoints against bot-driven attacks, see our API Security Best Practices guide covering how bot detection integrates with schema validation and rate limiting at the API layer.

Stop Bad Bots Without Blocking Good Traffic

Zlycloud Bot Shield uses ML trained on 3 billion daily requests to distinguish credential stuffers, scrapers, and advanced persistent bots from Googlebot and legitimate users — with a false positive rate under 0.1%. Transparent, no-CAPTCHA detection for 99% of cases.

Start Free Trial →

Good Bots vs Bad Bots: How ML-Powered Behavioral Analysis Stops Sophisticated Threats