How It Works
Core Concepts
States and the State Graph
A State is a DOM snapshot identified bySHA256(strippedDOM)[:16]. The State Graph is a directed graph where nodes are states and edges are actions that caused transitions. Navigation between states uses Dijkstra’s shortest path (with Yen’s K-shortest as fallback).
Near-duplicate detection uses normalized Levenshtein distance (threshold: 10%). For large DOMs (>10K chars), a sampling-based distance is used for performance.
Actions and Candidate Elements
Candidate clickable elements are discovered via CSS selectors (a, button, [onclick], [role=button], input[type=submit], framework-specific bindings like [ng-click], [v-on:click], etc.). Each candidate becomes an Eventable (graph edge) once fired, linking a source state to a target state with an event type (click, hover, enter).
Fragments (Visual Page Segmentation)
Pages are decomposed into Fragments - DOM regions identified by XPath, bounding box, subtree size, and content hash. Two modes:- Landmark (default): fast DOM-based extraction
- VIPS: vision-based page segmentation with multi-pass decreasing thresholds
Form Handling
The Form Handler detects and fills forms with smart value generation:- Field-name-aware values (email, password, phone, URL, etc.)
- Constraint-aware generation (respects
pattern,min/max,minlength/maxlength) - Pairwise fallback when filling all inputs at once fails
- File upload support with type-aware file selection
Exploration Strategies
| Strategy | Description |
|---|---|
| Default (BFS/DFS) | Deterministic traversal using fragment-based prioritization |
adaptive | Exp3.1 multi-armed bandit - balances exploitation (known-good actions) with exploration (untried actions) via importance-weighted probability sampling. Rewards based on new state discovery. |
Browser Management
- Embedded binaries: ships Chromium (macOS/Windows/Linux) and ungoogled-Chromium (Linux). Extracted on first run, cached by version.
- Headless mode: uses
headless=newwhen extensions are loaded (supports Chrome extensions unlike legacy headless). - Extensions: loaded via
--load-extension(e.g., uBlock Origin Lite for ad blocking during crawl). - Security flags disabled for crawling:
--disable-web-security,--ignore-certificate-errors,--allow-running-insecure-content. - Pool: multiple browser instances with round-robin selection.
Network Capture
Traffic is captured at the browser level (not page level) via CDP events, covering all tabs, popups, and iframes:NetworkRequestWillBeSent→ record requestNetworkResponseReceived→ record response headersNetworkLoadingFinished→ fetch response body
httpmsg.HttpRequestResponse and saved via the RecordSaver interface with source "spidering".
Termination Conditions
The crawl stops when any of these are met:- Maximum states discovered
- Maximum duration elapsed
- Maximum crawl depth reached
- Maximum consecutive failures
- No more candidate actions to explore
- Context cancellation
Entry Point
SpiderResult with: states discovered, actions executed/failed, forms submitted, duration, and records saved.