The Bot Landscape: By the Numbers
According to Zlycloud's 2026 Bot Traffic Report, analyzing over 3 billion daily requests across our network:
- 47% of internet traffic is automated (bot-generated), up from 42% in 2023
- 30% of all traffic is malicious bots — scrapers, credential stuffers, vulnerability scanners, and DDoS bots
- 17% is beneficial automation — search crawlers, monitoring tools, SEO tools, RSS readers
- 53% is human-generated — the traffic you're actually building your application for
Malicious bot traffic costs businesses $116 billion annually in infrastructure costs, content theft, competitive intelligence leakage, and direct fraud. E-commerce sites experience the worst of it: 24% of all login attempts on retail sites are credential stuffing attacks.
Why Bot Traffic Is Growing
AI-powered bot frameworks (Playwright-based bots with natural language instruction sets) have dramatically lowered the skill barrier for bot operations. What previously required custom coding now takes minutes to configure. Botnet-as-a-service markets offer residential proxy pools, CAPTCHA-solving services, and browser automation — for as little as $50/month.
Good Bots: Identifying and Whitelisting Beneficial Automation
Not all automated traffic should be blocked. Search engine crawlers, uptime monitors, and SEO tools provide genuine value. Blocking Googlebot means your pages disappear from search results; blocking your own monitoring means you're blind to outages.
Major Good Bot Categories
Search Engine Crawlers
Googlebot, Bingbot, DuckDuckBot, Baiduspider, Yandexbot — index your content for search ranking. Should always be allowed.
Identification: Reverse DNS verification (IP resolves to googlebot.com, then forward resolves back)
Uptime & Performance Monitors
Pingdom, UptimeRobot, StatusCake, Datadog Synthetic, New Relic Synthetics. Known IP ranges, consistent User-Agents.
SEO & Analytics Crawlers
Semrush, Ahrefs, Moz — crawl for backlink analysis and SEO auditing. Often aggressive crawl rates; rate limiting appropriate but not blocking.
Security Scanners (Authorized)
Your own Qualys, Nessus, or Burp Suite scans. Whitelist by IP to prevent WAF blocking your own security testing.
Verifying Good Bot Identity
User-Agent strings are trivially spoofed. A bot claiming to be Googlebot should be verified through reverse DNS lookup: the source IP must resolve to a hostname ending in googlebot.com or google.com, and that hostname must forward-resolve back to the same IP. This double-verification catches IP spoofing and User-Agent impersonation simultaneously.
# Verify Googlebot authenticity
$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
$ host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
# Both resolve ✓ — legitimate Googlebot
# Mismatch → spoofed bot claiming to be Googlebot
Bad Bot Categories: A Taxonomy
Content and Price Scrapers
Scrapers systematically extract content — product listings, prices, inventory levels, news articles, job postings — to republish it, use it for competitive intelligence, or train AI models. Modern scrapers rotate User-Agents, use residential proxies, implement random delays, and execute JavaScript to handle SPAs.
E-commerce operators face price scrapers that monitor pricing in real-time, enabling competitors to undercut by fractions of a cent automatically. At scale, this creates race conditions that compress margins across entire market categories.
Credential Stuffing Bots
Using leaked username/password databases (billions of credentials available on dark web markets), credential stuffing bots attempt authentication against target applications at high velocity. Sophisticated campaigns use:
- Residential proxy pools to distribute across millions of IPs (defeating IP-based blocking)
- Slow-and-distributed rates (1-2 attempts per IP per hour — below rate limit thresholds)
- Valid User-Agent strings and TLS fingerprints matching real browsers
- CAPTCHA-solving services (human solvers or ML-based)
Success rates of 0.1-2% sound low but represent millions of compromised accounts when the credential list contains 500 million entries.
Carding Bots
Carding bots test stolen credit card numbers against e-commerce sites, typically using small "micro-authorization" amounts ($0.01 - $5.00) to verify the card is valid before selling it or using it for large purchases. They target sites with simple checkout flows and weak fraud detection. A carding campaign can process 100,000+ card tests per hour against a single merchant.
Scalper Bots (Inventory Hoarding)
Scalper bots monitor limited-inventory products (sneakers, concert tickets, gaming consoles, GPUs) and purchase at the moment of availability — faster than any human can act. They add items to cart, complete checkout including payment form submission, and complete purchases within milliseconds of stock appearing. Secondary market markups of 200-500% on scalped goods are common.
L7 DDoS Bots
As covered in our DDoS mitigation guide, application-layer DDoS attacks use botnets sending legitimate-looking HTTP requests at high volume. The distinction from credential stuffers or scrapers is intent (exhaustion vs theft) and rate (thousands of rps vs a few per second).
Advanced Persistent Bots (APBs): Human Mimicry
The most sophisticated category — Advanced Persistent Bots — are specifically designed to defeat detection systems by mimicking human behavior patterns:
- Running in real browser instances (Playwright, Puppeteer) rather than HTTP clients — generating valid JS execution, DOM interaction, and rendering artifacts
- Executing realistic mouse movement patterns (Bezier curves, not straight lines)
- Typing at variable, human-like speeds with realistic inter-keystroke timing
- Spending realistic "reading time" on pages before navigation
- Generating valid browser fingerprints (canvas fingerprint, WebGL, audio context)
- Using residential proxies with clean IP reputation
- Handling CAPTCHAs via human CAPTCHA farms (humans solving at $0.50-2.00/1000)
APBs are used by well-funded scraping operations, nation-state actors, and sophisticated fraud rings. They defeat simple signature-based and threshold-based detection — requiring ML-powered behavioral analysis to identify.
Detection Layer 1: IP Reputation
The first and fastest detection layer scores requests based on the source IP's history and classification:
- ASN classification — Traffic from cloud hosting ASNs (AWS, GCP, Azure, DigitalOcean, Hetzner) is highly suspicious for browser-impersonating bots; legitimate users are on ISP ASNs. ASN-level scoring doesn't block (cloud users exist) but adds signal.
- Datacenter IP ranges — Published lists of datacenter IP blocks (used by bots for cheap, high-bandwidth compute) score heavily negative.
- Tor exit nodes — Tor network exit IPs indicate deliberate anonymization. Legitimate use cases exist, but Tor traffic warrants elevated challenge.
- Residential proxy pools — Increasingly sophisticated detection of residential proxy services using behavioral correlation (multiple IPs in different cities exhibiting identical behavior patterns).
- Historical threat intelligence — IPs previously associated with attacks, in blocklists, or sharing subnets with known malicious infrastructure.
Detection Layer 2: TLS and Browser Fingerprinting
JA3/JA4 TLS Fingerprinting
Every TLS client produces a fingerprint based on the parameters in its ClientHello message: supported cipher suites, TLS extensions, elliptic curves, and elliptic curve point formats. The JA3 hash (MD5 of these parameters) and its successor JA4 (with improved context and consistency) create stable identifiers for specific TLS implementations.
# JA3 fingerprint components (comma-separated, then MD5-hashed):
TLSVersion,Ciphers,Extensions,EllipticCurves,EllipticCurvePointFormats
# Example legitimate Chrome 121 JA3:
771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,
0-23-65281-10-11-35-16-5-13-18-51-45-43-27-17513-21,
29-23-24,0
# MD5: 66918128f1b9b03303d77c6f2eefd128
# curl/7.88 JA3 — distinctly different cipher ordering:
771,4866-4865-4867-49196-49195-49199-49198-49188-49187-49162-49161-52393-52392-
49172-49171-157-156-61-60-53-47-255,0-11-10-13-16,29-23-24-25,0
# MD5: different hash → NOT matching real Chrome
Bots using standard HTTP libraries (Python's requests, Go's http, Node.js) produce JA3 hashes that don't match any real browser — an immediate high-confidence bot signal. Sophisticated bots use browser automation (which produces valid browser JA3s) or patch their TLS stack to match Chrome's fingerprint.
HTTP/2 Fingerprinting
Complementing TLS fingerprinting, HTTP/2 settings frames reveal the client implementation: the order and values of SETTINGS parameters (header table size, initial window size, max concurrent streams) form a fingerprint that differs between browsers and HTTP libraries.
Browser Environment Fingerprinting
For requests that do execute JavaScript, browser fingerprinting collects dozens of signals:
- Navigator properties (platform, language, hardware concurrency)
- Canvas rendering fingerprint (GPU/driver-specific rendering differences)
- WebGL vendor and renderer strings
- Audio context fingerprint (AudioBuffer processing characteristics)
- Screen resolution and pixel density
- Installed fonts (via CSS timing attacks)
- WebDriver presence (
navigator.webdriverflag) - Automation-specific browser properties (Chrome's
window.chromeobject)
Detection Layer 3: Behavioral Analysis
The most sophisticated detection layer analyzes patterns over time and across sessions — finding anomalies that individual request analysis misses:
Mouse and Interaction Analysis
Human mouse movements follow characteristic patterns: acceleration and deceleration, micro-tremors, path curvature influenced by motor physics. Bot-generated mouse movements use linear interpolation (perfectly straight lines) or overly "perfect" Bezier curves. Features analyzed:
- Mouse velocity distribution (humans show non-uniform speed profiles)
- Direction change frequency and angle distribution
- Time-to-click after mouse-stop (humans pause before clicking)
- Click pressure and duration (touchscreen devices)
- Scroll velocity and acceleration patterns
Request Timing Analysis
Human browsing introduces natural timing variations: reading time before navigation, thinking time before form submission, irregular inter-request intervals. Bot requests exhibit mechanical regularity: identical or algorithmically-varied timing, no correlation between content length and time spent on page.
Session Coherence Analysis
Human sessions tell coherent stories: landing page → browse category → view product → add to cart → checkout. Bot sessions often violate this coherence: direct POST to checkout without visiting the product page, requests in impossible order, missing required cookies that browsers would naturally accumulate.
ML Model Architecture: Training on 3 Billion Daily Samples
Zlycloud's Bot Shield ML model processes 3+ billion requests per day across our global network, using this traffic as continuous training signal. The production model is a gradient-boosted ensemble with the following feature set:
- Network features (17): IP reputation score, ASN category, geolocation consistency, proxy/VPN/Tor signal
- TLS features (12): JA3/JA4 hash match to known browser fingerprints, TLS version, cipher suite ordering anomaly
- HTTP features (28): User-Agent consistency with JA3, HTTP/2 settings fingerprint, header ordering, Accept-Language consistency with IP geolocation
- Behavioral features (34): Request rate, session coherence score, timing distribution statistics, mouse movement entropy
- Historical features (11): Prior classification history for this IP, client fingerprint's historical behavior, account's risk score
The model outputs a continuous risk score from 0.0 (definitely human) to 1.0 (definitely bot), with configurable thresholds for different response actions. Scores update in real-time as each new request provides additional behavioral evidence within a session.
Challenge Responses: The Spectrum from Transparent to Friction
| Challenge Type | Bot Resistance | User Friction | Best Used For |
|---|---|---|---|
| Transparent fingerprinting | Moderate — defeats simple bots | Zero — invisible to users | All traffic, passive scoring |
| JS Challenge | High — defeats headless HTTP clients | Low — ~100ms added on first visit | Elevated risk score, first visit |
| Proof-of-Work | High — increases cost per request | Low — 50-200ms compute time | High-value endpoints, DDoS mitigation |
| CAPTCHA (image recognition) | High — defeats automated solvers | High — annoying, accessibility issues | Last resort, very high risk scores |
| Shadow ban / honeypot | High — bots don't know they're blocked | None | Scrapers (show plausible but fake data) |
| Rate limit + 429 response | Low — bots slow down, try again | Low — only affects heavy users | Simple volumetric bot control |
Managing False Positives: Protecting Legitimate Users
The hardest challenge in bot detection is minimizing false positives — blocking or challenging legitimate users. False positives have real costs: frustrated users abandon sessions, accessibility tools may be blocked, corporate proxy users appear as bots, and automated testing pipelines break.
"Every bot management system generates false positives. The measure of a good system isn't zero false positives — it's keeping false positive rate below 0.1% while maintaining 99%+ bot detection accuracy. At 3 billion daily requests, even 0.01% false positives means 300,000 wrongly-challenged legitimate users per day."
Strategies for reducing false positives:
- Score thresholding: Use challenges (JS challenge, PoW) for medium-confidence detections rather than outright blocking. If a real user is falsely challenged, they pass the JS challenge and continue. Only block at very high confidence scores (0.95+).
- Behavioral confirmation: Don't make permanent decisions from a single request. Accumulate evidence across the session before escalating the response.
- Accessibility considerations: Users with assistive technologies may have different behavioral patterns. Screen readers navigate differently; motor impairment affects mouse movement. Factor assistive technology signals into scoring.
- Allowlisting for known-good clients: API clients with valid API keys, monitoring IPs, and verified search engines bypass behavioral scoring.
- Customer feedback loop: Provide mechanisms for users to contest blocks, and use confirmed legitimate-user cases to improve the model.
For organizations securing API endpoints against bot-driven attacks, see our API Security Best Practices guide covering how bot detection integrates with schema validation and rate limiting at the API layer.
Stop Bad Bots Without Blocking Good Traffic
Zlycloud Bot Shield uses ML trained on 3 billion daily requests to distinguish credential stuffers, scrapers, and advanced persistent bots from Googlebot and legitimate users — with a false positive rate under 0.1%. Transparent, no-CAPTCHA detection for 99% of cases.
Start Free Trial →