When curl Stops Working: Multi-Level Bot Detection and Where the Cloud Browser Fits In / Habr

This article isn't about Puppeteer being a bad tool. It is an excellent tool. Just like curl. And proper TLS fingerprinting via uTLS will bypass most protections. However, there is a class of tasks where even a perfect network stack won't save you—because detection has long moved beyond HTTP headers and landed at the rendering engine's behavioral level. Let's break down exactly where this boundary lies.

Five years ago, anti-fraud lived at the network level: it looked at IP reputation, checked the User-Agent, and verified the Referer. Today, Cloudflare, Akamai, and DataDome operate across several echelons:

L1 — Network: IP reputation, ASN, JA3/JA4 TLS handshake fingerprint, MTU/TTL packets.
L2 — Browser/Runtime: navigator.webdriver, WebGL RENDERER/VENDOR, Canvas hash, font enumeration via document.fonts, AudioContext fingerprint, navigator.hardwareConcurrency & deviceMemory. The browser honestly reports the number of CPU cores and RAM capacity. A server-side Chromium on a VPS with one vCPU and 512 MB of memory immediately stands out from a user's laptop (8 cores, 16 GB). In a cloud profile, these values are synchronized with the hardware legend.
L3 — Behavior: Mouse movement, keystroke dynamics, scroll patterns, XHR request timings. The most aggressive L3 implementation is Akamai sensor data: an obfuscated JS bundle (~150 KB) that collects 50+ behavioral signals (accelerometer, touch pressure, mouse movement deviation from a straight line) and sends an encrypted blob to /akam/11/... before any target request. Decoding it manually is an entirely separate discipline.
L4 — Passive Network Analysis: HTTP/2 fingerprint (order of HEADERS frame headers, SETTINGS values, window size), TCP stack (buffer size, SYN packet options). Akamai Bot Manager builds a fingerprint precisely at this level—before the browser has sent its first byte of JavaScript.

Bypassing L1 with raw curl and proper uTLS is realistic. Bypassing L2-L3 without a full-fledged browser is no longer possible. This is exactly where the scope of "pure" tools ends and the domain of cloud browsers (BaaS) begins.

Connection Architecture: From Script to Target

In a classic setup, your server makes the requests itself, wrestles with TLS fingerprints, and tries not to "leak" via WebRTC. In a cloud browser setup, your code acts merely as a remote control. The "heavy lifting" happens in the cloud, where a fully managed browser infrastructure is deployed.

Hitting a specific wall is what forced me to look for BaaS solutions. Local Playwright with residential proxies stopped passing L2 detections: the Canvas hash "leaked" hardware mismatches, the WebGL RENDERER exposed a server VM, and the behavioral track was too "smooth." Stealth plugins were patching symptoms, not the cause. I needed a tool where the browser context is formed from scratch on the right hardware, rather than patched over. That's when NodeMaven Scraping Browser came up—not as a recommendation, but as one of the few services with proper CDP documentation.

Where Does the Local Stack Hit the Ceiling?

The main pain point of local automation is Leakage. Even when using residential proxies, the real hardware fingerprint often leaks through Puppeteer's patches.

Hardware Fingerprinting: Sites ask the browser to render a complex 3D shape via WebGL. The way a graphics card processes shadows and anti-aliasing is unique. Cloud browsers mask this metadata, serving valid but substituted data instead of the real hardware.
Stack Synchronization: Anti-fraud detects a mismatch between your IP (home provider) and your data center server's MTU/TTL parameters. In BaaS solutions, the proxy layer and the browser are synchronized at the infrastructure level.
Resource Intensity: Running 50 Chrome windows locally will crash an average VPS. The cloud allows you to open sessions in parallel (scaling is only limited by your plan) since JS rendering and DOM processing happen on the provider's side.

Anatomy of a Session: What's Under the Hood?

When forming a connection string, you are assembling your bot's "identity." Everything from GEO to session lifetime (sid) and a unique fingerprint profile (pid) is specified in the URL parameters.

// Playwright connection example
const { chromium } = require("playwright");

(async () => {
  const endpoint = "wss://user_country-us_sid-888:pass@browser.nodemaven.com";
  const browser = await chromium.connectOverCDP(endpoint);
  const page = await browser.newPage();
  
  await page.goto("https://target-website.com");
  // All CPU load is in the cloud; you only get the result
  await browser.close();
})();

Practical Test: The "Lie Detector"

To test the actual pool quality, I picked an exotic GEO—Fiji (Vodafone). The logic is simple: if a provider mixes in cheap data-center IPs, an exotic location will expose it faster than a standard US/DE node. I ran it through three independent public checkers: Scamalytics, IPQualityScore, and Pixelscan. I specifically looked at: Fraud Score (IP reputation), connection type (ISP/datacenter/mobile), and the presence of DNS and WebRTC leaks.

The results were expectedly clean for a mobile channel: Fraud Score at 0–2 on Scamalytics, connection type identified as ISP/Mobile (Vodafone Fiji Limited), with no DNS or WebRTC leaks. This doesn't mean it's a "perfect IP"—it means the address isn't flagged in known spam databases and behaves like a real subscriber.

Parameter	Result	Tool
Fraud Score	2/100 (Scamalytics)	Scamalytics
Fingerprint	Consistent	Pixelscan
Latency	~150-250ms (for WSS)	Custom ping

When Is It Worth Paying Extra for BaaS?

A cloud browser is not a magic wand; it's a tool for resource optimization and bypassing L2-L3 protections. In my scenarios, BaaS turned out to be more expensive than regular proxies in direct costs, but cheaper in terms of maintenance time: no need to constantly fix stealth patches and administer a dedicated server fleet for dozens of Chromium instances. It's not a universal solution, but an option that makes sense when you hit a wall specifically with L2–L3 detections.

Anatomy of a Scraping Browser: WSS, Autoscaling, and Real Latency

When you work with standard Puppeteer locally, the library launches a Chromium instance directly on your system. This consumes hundreds of megabytes of RAM per tab. With a Scraping Browser, interaction shifts to the WSS (WebSocket Secure) protocol level.

Dissecting the WebSocket Connection

Instead of launching a chrome.exe process, your script initiates a secure connection with a remote endpoint. This isn't just an HTTP "send-receive" request. WSS creates a persistent two-way communication channel. This is critical for CDP (Chrome DevTools Protocol)—through it, your Playwright or Puppeteer sends commands (click, type, wait for selector) and receives events back in real-time.

A nuance not written in the documentation: the mere fact of a CDP connection is potentially detectable. Some sites check if the --remote-debugging-port is open via window.chrome.runtime and non-standard properties of the performance object. Scraping Browser solves this through isolation—the CDP tunnel exists only between your script and the cloud; the target site sees a clean Chromium without debugging artifacts.

The Connection String as a Config

The most interesting part is buried right in the URL. It's not just an address; it's a full-fledged one-line configuration file:

wss://{user}_country-us_sid-888_pid-123:{pass}@browser.nodemaven.com

By changing parameters, you switch GEOs on the fly or bind a session to a specific IP without restarting the code:

country-us: Forces the browser to boot with American "hardware" and IP.
sid (Session ID): Any random string turns the session into a "Sticky" one. As long as you use the same sid, you stay on the same IP, which is critical for logins or cart management.
pid (Profile ID): A unique fingerprint. The site will "recognize" you as an old acquaintance, even if you return a week later.

The Advantage: Unlimited Concurrent Sessions

The main feature of the cloud is autoscaling. Since all DOM rendering, heavy JavaScript execution, and graphics processing happen on the provider's side, the load on your local CPU and RAM trends toward zero. You can run 100 or even 500 parallel sessions from a standard laptop. The infrastructure distributes resources itself, spinning up new browser instances for your request. To anti-fraud systems, this looks like hundreds of different users logging in from different devices, even though they are controlled from a single script.

An Important Nuance About the Network Stack

There is a common misconception here: "zero load" means zero load on your server's CPU/RAM—rendering truly happens in the cloud. However, if you pull 500 threads from sites that load images, videos, and ad scripts, your network channel still has to digest the incoming data stream via WSS.

The solution is simple: use page.route() in Playwright (or its equivalent in Puppeteer) to block unnecessary resources directly on the cloud browser side, before the data is transferred to you:

await page.route('**/*.{png,jpg,jpeg,gif,svg,mp4,webp,woff2}', r => r.abort());
await page.route('**/{ads,analytics,tracking}/**', r => r.abort());

This saves both your traffic and channel bandwidth, which is especially critical when working with media-heavy sites like TikTok.

Goodbye, Overheating: Admin Time, Not CPU Cycles

You often hear: "One instance of Chromium eats RAM, the cloud saves your CPU." An engineer will reply: "So what? I have a 64-core server, I don't care about CPU, I care about profit."

The truth lies elsewhere. Shifting rendering to a BaaS saves the system administrator's and developer's time, not just hardware resources. You don't need to administer a farm of 20 servers with X11 and VMs just to stably run 500 browsers, regularly update their profiles, and monitor memory leaks.

Let's Talk Numbers: What About Latency?

Practical measurements (measured via Date.now() around a CDP command):

Operation	Local Chromium	Cloud Browser (US→US)	Cloud Browser (RU→US)
`page.goto()` + DOMContentLoaded	~180 ms	~320 ms	~580 ms
`page.click()` → response	~15 ms	~45 ms	~90 ms
`page.waitForResponse()` XHR	~200 ms	~350 ms	~620 ms

(Measurement: median of 20 runs between 14:00–16:00 UTC, target — public portal with UAM enabled, configuration: country-us, fixed sid)

The honest conclusion: the overhead of a WSS tunnel is a real ~30–100 ms per command depending on geography. For tasks where scrolling speed is paramount (mass data scraping without authorization), this matters. For tasks where the "cleanliness" of bypassing anti-fraud is crucial (login, registration, cart operations), these milliseconds are not critical—you still need human-like delays between actions there anyway.

A separate scenario is Cloudflare Turnstile (a replacement for classic CAPTCHA). Unlike hCaptcha, Turnstile doesn't show the user anything visual: it silently runs a JS challenge in the background and issues a token. For a cloud browser, this means an additional 2–4 seconds on the first page.goto()—Chromium must pass the challenge before serving content. Factor this into your timeouts.

Code Examples: From "Hello World" to Real Tasks

Integration looks as native as possible. Instead of launching a local browser with the chromium.launch() command, we use connectOverCDP().

// Basic connection in 30 seconds
const { chromium } = require("playwright");

(async () => {
  const auth = "your_user_country-us_sid-random123:your_password";
  const endpoint = `wss://${auth}@browser.nodemaven.com`;

  console.log("Connecting to the cloud browser...");
  const browser = await chromium.connectOverCDP(endpoint);
  const page = await browser.newPage();
  
  await page.goto("https://target-website.com");
  console.log("Page title:", await page.title());

  await browser.close();
})();

Case 1: Intercepting a Hidden API Behind a Chain of Redirects

Many sites don't serve data directly—they make XHR requests to an internal API after several redirects and JS initialization. Solution: we intercept the responses with a network hook directly in the cloud browser.

// ...browser connection...

// Block unnecessary items before loading to save traffic
await page.route('**/*.{png,jpg,gif,svg,mp4,woff2}', r => r.abort());

// Intercept the target API response hiding behind JS initialization
const apiDataPromise = page.waitForResponse(
  resp => resp.url().includes('/api/v2/products') && resp.status() === 200
);

// Set custom headers to simulate a real request from the app
await page.setExtraHTTPHeaders({
  'X-Requested-With': 'XMLHttpRequest',
  'Accept-Language': 'en-US,en;q=0.9',
});

await page.goto('https://target-site.com/catalog', { waitUntil: 'networkidle' });

const apiResponse = await apiDataPromise;
const json = await apiResponse.json();
console.log('Products received:', json.items.length);

// ...close browser...

Important: Don't kill your trust with your own code. A perfect cloud profile won't save you if your script behaves like a robot. A standard await page.click('.button') in Playwright literally "teleports" the cursor to the exact geometric center of the element in 0 milliseconds in a straight line. For the L3 echelon, this is an instant detection. Always randomize click coordinates (offset) and use human-like trajectory generation libraries (Bézier curves) for mouse movement prior to clicking.

Case 2: CDP Session for Live CAPTCHA Solving

If a site throws a CAPTCHA, don't panic. Via CDP, you can pass the session URL to a human operator or a solver service, which will solve it in real-time right inside the exact same browser context.

// ...inside an async function after page.goto()...

// Create a CDP session for the current page
const cdpSession = await page.context().newCDPSession(page);

// Get the inspect link — this can be opened in local Chrome DevTools
const { frameTree } = await cdpSession.send('Page.getFrameTree');
console.log('Frame URL:', frameTree.frame.url);

// Wait for the signal of successful CAPTCHA completion
// (a human or solver opens the URL and clicks in the cloud browser)
await page.waitForFunction(() => !document.querySelector('.captcha-container'));
console.log('CAPTCHA solved, continuing...');

The fundamental difference from a local solution: the browser context already has the "correct" fingerprint from the very first byte—solving the CAPTCHA doesn't create anomalies in the behavioral track.

Combat Case: Cloudflare UAM

Objective: Automate data collection from a portal that enables UAM (Under Attack Mode) during peak hours.

Result: Raw curl is blocked instantly (JA3 mismatch). Puppeteer on server IPs passes the JS Challenge but doesn't get a cf_clearance due to the data center ASN.

Solution: Scraping Browser + residential ISP. Passes the JS Challenge in 5-7 seconds, issues the cookie, establishes the session.

// Cloudflare UAM triggers a JS Challenge — wait for it to pass
await page.waitForFunction(() => {
  return !document.title.includes('Just a moment');
}, { timeout: 30000 });

const cookies = await page.context().cookies();
const cfClearance = cookies.find(c => c.name === 'cf_clearance');
console.log('cf_clearance received:', cfClearance?.value);

Fingerprinting: How Sites Determine Your "Hardware" — And What TLS Has to Do With It

If you thought IP and User-Agent were all a site knows about you, I have bad news. Modern anti-fraud operates in multiple echelons, and the most underestimated of them is the network level.

JA3/JA4 — TLS Handshake Fingerprint

When your client opens an HTTPS connection, it sends a ClientHello with a set of supported ciphers, extensions, and their order. This set is hashed into a JA3 fingerprint (or the newer JA4). Chrome 120 has one, curl has a radically different one, Python requests has a third. Cloudflare sees this hash before you've even sent the first HTTP byte.

How Scraping Browser Solves This:

Since the connection to the target site is established by the cloud Chromium itself, not your script, the JA3 fingerprint is identical to a real browser of the required version. Your connectOverCDP is merely a tunnel inside an already established TLS connection.

It's important to remember that TLS is only half the picture. Cloudflare and Akamai additionally build an HTTP/2 fingerprint: the order of pseudo-headers (:method, :path, :authority), SETTINGS frame values, and the initial window size. Chrome's pattern here is specific and differs from curl or Requests even with an identical JA3. The tls-fingerprint tool by lwthiker allows you to check your fingerprint publicly.

Fonts and Plugins — A Silent Leak

Through document.fonts.check(), a site iterates through hundreds of system fonts. The font set is unique to the OS and locale. Similarly with navigator.plugins: on a server Chromium, there are usually zero, which is an immediate red flag. A cloud browser emulates profiles with a realistic set of fonts and plugins for a specific platform via the pid (Profile ID).

The Four Horsemen of L2 Detection:

Headless Anomalies: Bots used to be caught by the navigator.webdriver = true flag. Now they look at indirect signs: the absence of the Chrome PDF Viewer plugin, zero window.outerWidth/outerHeight dimensions, or blocked notifications (Notification.permission). BaaS infrastructure runs instances in full headful mode (with a virtual display) or uses a deeply patched new --headless=new mode, where the entire graphics stack is identical to desktop.
Canvas Fingerprinting: A site asks the browser to render hidden text or a shape. The result depends on the graphics card, drivers, and fonts. A one-pixel difference, and you are "suspicious."
WebGL Metadata: Deep scanning of the graphics subsystem. Anti-fraud sees the GPU model, VRAM capacity, and 3D scene rendering nuances.
WebRTC Leakage: A technology that frequently "leaks" your real local IP, bypassing any proxies.

Masking vs. Noise: How Not to Shoot Yourself in the Foot

Many beginners try to simply disable JavaScript or block hardware data transmission. To a security system, an "empty" profile is just as much of a bot marker as a server IP. Scraping Browser takes the Masking approach:

Real Hardware Parameters: Instead of blocking, the browser returns valid but substituted data of real hardware.
Canvas Noise: Minimal "noise" is added to the rendering, altering the fingerprint hash but keeping it similar to the result of a regular device.
Resolution & Language: You set a screen resolution and system language that 100% match your proxy's GEO.

Important Nuance: Modern anti-fraud systems rarely hand out a hard block with an HTTP 403. A shadow ban triggers much more often—requests go through but return empty results, faked prices, or demoted search outputs. This is harder to detect on the scraper's side. The symptom: a discrepancy in data between an authorized session and the bot during identical requests.

I will similarly add that cheap open-source stealth plugins often make a fatal mistake—they mix in Math.random() on every toDataURL() call. Anti-fraud (like DataDome) simply asks the browser to render the canvas twice in a row within a millisecond. If the hashes are different without DOM changes, you are a bot. Cloud browsers apply deterministic noise (tied to the pid): it distorts the image, but always identically for a specific session, simulating a real hardware quirk of a specific GPU.

One-Click "Identity" Configuration

In the profile creation interface, this looks like a survival checklist: Windows/macOS/Linux emulation, time zone spoofing, and simulation of specific graphics card models. When you connect via pid, all these settings are already "glued" to your session.

Proxy Management: From Residential to Static ISP

Proxies are the foundation of trust. If your "hardware" is perfect but the request comes from a spammed data center, anti-fraud will shoot you down on takeoff.

Rotating: Every new request is a new IP. Ideal for mass scraping of thousands of pages without authorization.
Sticky: Holding a single IP for up to 24 hours. Critical for multi-accounting: you log in, browse the site, and put an item in the cart.

Quality Filter: How Not to Feed "Dirty" IPs to Anti-Fraud

The interface implements a filtration mechanic:

Quality (Default): Filters out addresses with a low Trust Score or those on blacklists.
Quality + Speed: Prioritizes cleanliness and minimum ping.
Max Pool Size: The widest possible base, if volume is more important than cleanliness.

Deep Targeting via WSS String

You can change locations "on the fly" simply by modifying the connection string:

wss://{username}_country-{country}-region-{region}-city-{city}-isp-{isp}-sid-{sid}:{password}@browser.nodemaven.com

country-us: The whole world at your feet.
region-new_york: Access to local state content.
city-brooklyn: Targeting down to the city level.
isp-t_mobile: Simulating a user of a specific provider.
sid-{any_string}: Any string makes the session "sticky."

Advanced Automation: API and Live Debugging

If you need to manage hundreds of accounts or build your own service on top of the infrastructure, manual profile creation isn't enough. This is where the NodeMaven API v2 comes into play.

Edge Case: Hitchhiking a "Runaway" Sub-User

Listing endpoints is boring, so here's a real-world situation. In production, we had several scripts, each under its own sub-user. One of them started receiving mass 403 errors (the site changed its structure). The script didn't crash; it kept hammering requests, burning through residential traffic at ~2 GB/hour.

Solution: A watchdog via API.

import requests, time

SUB_USER = "project_parser_v2"
API_KEY = "your_api_key"
ERROR_THRESHOLD = 50   # 403 error limit
CHECK_INTERVAL = 60   # check interval (sec)

while True:
    # Request stats for a specific sub-user
    stats = requests.get(
        f"https://api.nodemaven.com/v2/statistics/data/?sub\_user={SUB_USER}",
        headers={"Authorization": f"Bearer {API_KEY}"}
    ).json()

    error_rate = stats.get("errors_403", 0)
    if error_rate > ERROR_THRESHOLD:
        # Block sub-user via API until causes are determined
        requests.post(
            f"https://api.nodemaven.com/v2/sub-users/{SUB_USER}/disable/",
            headers={"Authorization": f"Bearer {API_KEY}"}
        )
        print(f"[ALERT] Sub-user {SUB_USER} disabled: {error_rate} errors")
        break
    time.sleep(CHECK_INTERVAL)

This is exactly the main reason why an API is necessary: programmatic control of budget and safety.

Live Debugging: Looking Into the Cloud Bot's "Brain"

Usually, working with a remote browser is a "black box." But CDP support turns it into transparent glass. You create a CDP session via page.context().newCDPSession(page), request Page.inspect, and get a link. Now you can see in real-time how the bot solves a CAPTCHA or at what step the layout "broke."

The Pixelscan Paradox: Why Are Just Proxies Not Enough?

If we run this proxy through Pixelscan in a standard anti-detect setup, we often see: Location: Green, but Hardware: Red (Unreliable).

This is classic: anti-fraud spots a mismatch between the hardware and the network fingerprint. Scraping Browser solves this at the browserContext formation stage. WebGL, AudioContext, and font parameters are patched at the Chromium runtime level before the page even loads. To anti-fraud, "red zones" simply do not exist—it sees a consistent profile from the very first byte.

Conclusion: An Honest Look at BaaS

Browser as a Service (BaaS) is a specialized tool. BaaS with residential proxies will cost about 5 times more than regular data-center addresses. If you just need to download images from unprotected sites, this is "overkill" (using a sledgehammer to crack a nut).

When you actually need it:

You've hit a wall with L2-L3 detections (Cloudflare/Akamai), and stealth plugins aren't saving you.
IP reputation and a native TLS fingerprint are crucial.
You don't want to administer Docker farms for hundreds of Chromium instances and battle memory leaks.

Promo Codes for Readers: The article was written for specific tasks—your scenarios might yield different numbers. If you want to replicate the tests, the guys at NodeMaven provided promo codes: PROXY35 (−35% on mobile and residential) and PROXY40 (−40% on ISP). The scripts from the article can be run without changes.

Если есть идеи для разбора, нашёл ошибку в конфиге
или хочешь предложить тему — пиши на
aleksandr@murzin.digital. Отвечаю.