Skip to main content
Deparos is an intelligent content discovery engine that performs directory enumeration, directory fuzzing, and endpoint discovery against web applications. It goes beyond static wordlist brute-forcing by learning from every response - adapting its strategy, growing its wordlists dynamically, and filtering false positives through fingerprint-based soft-404 detection.

How It Works

Target URL


┌──────────────────────────────────────────────────────┐
│  Initialization                                      │
│  1. Probe target, extract host components            │
│  2. Fetch robots.txt                                 │
│  3. Learn baseline fingerprints (3-sample soft-404)  │
│  4. Load prior session data (if resuming)            │
│  5. Generate initial tasks from wordlists + observed │
└──────────────────┬───────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│  Priority Queue                                      │
│  ┌────┬────┬────┬────┬─────┬──────┬──────┬────────┐ │
│  │ P0 │ P1 │ P2 │ P4 │ P5  │ P7   │ P11  │ P12   │ │
│  │Spdr│Obs │Obs │Obs │Short│ExtVar│Long  │Fuzz   │ │
│  │ JS │Name│File│Dir │Word │Numric│Word  │       │ │
│  └────┴────┴────┴────┴─────┴──────┴──────┴────────┘ │
└──────────────────┬───────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│  Payload Coordinator                                 │
│  Expander pulls tasks → Expand() yields payloads     │
│  N workers execute payloads concurrently             │
│                                                      │
│  For each response:                                  │
│    Fingerprint check (soft-404?) ──→ discard         │
│    WAF detection ──→ track/backoff                   │
│    Real discovery ──→ callbacks                      │
└──────────────────┬───────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│  Discovery Callbacks                                 │
│  OnDirectoryDiscovered():                            │
│    • Learn new fingerprints for directory             │
│    • Create recursive tasks (wordlists + observed)   │
│    • Extract breadcrumb directories                  │
│  OnFileDiscovered():                                 │
│    • Extract extension → trigger extension tasks     │
│    • Numeric segment → fuzz ±10 variations           │
│    • Queue extension variant probes (.bak, .old, …)  │
└──────────────────┬───────────────────────────────────┘

        ┌──── loop back to Priority Queue ────┐
        │  (new tasks from discoveries)       │
        └─────────────────────────────────────┘

What Makes It Adaptive

1. Fingerprint-Based Soft-404 Detection

Before scanning, the engine requests 3 random non-existent paths and extracts response attributes (status code, content-type, headers, body hash, content-length ranges). Only attributes stable across all 3 samples become the baseline signature. During scanning, responses matching this signature are discarded as false positives. When an unknown response pattern appears, a 4-strategy wildcard validation (prefix, suffix, extension, middle) confirms whether the discovery is real or a new soft-404 variant - and learns the new pattern.

2. Observed Collection System

Four data pools grow continuously during the scan:
PoolSourcePriority
Observed NamesSpider links, JS parsing, response body tokenizationP1
Observed FilesComplete filenames from discoveriesP2
Observed ExtensionsFile extensions from discoveriesP5
Observed PathsFull path segments from URLsP4
Every newly discovered directory is probed with ALL observed values as high-priority tasks. When a new extension is found for the first time, it triggers tasks across ALL known directories.

3. JavaScript Intelligence

Two layers of JS analysis feed endpoints back into the discovery queue:
  • JSScan (embedded binary): Deobfuscates bundled JS, resolves string concatenation, traces variable assignments, and extracts fetch() / XMLHttpRequest / $.ajax call sites into full HTTP request specs.
  • Spider extractors: Parse inline <script> tags and JS string literals for URL patterns.
Extracted endpoints become priority-0 tasks - tested before any wordlist fuzzing.

4. Dynamic Wordlist Growth

Response bodies are tokenized (content-type-aware for HTML, JSON, JS, CSS) to extract candidate words. These feed into the observed name pool and are replayed against every directory.

5. Recursive Directory Expansion

When a file is found at /a/b/c/file.txt, the engine extracts /a/, /a/b/, /a/b/c/ as directories to test. Each new directory triggers its own full task set (wordlists + observed + modules).

Task Types

TaskPriorityDescription
Spider/JS Extracted0URLs from link extraction and JS analysis
Observed Names1Filenames seen during scan, replayed per directory
Observed Files2Complete name+extension pairs
Observed Paths4Full path segments from URLs
Short Wordlist (files)5Common filenames from short wordlist x extensions
Short Wordlist (dirs)6Common directory names from short wordlist
Extension Variants7Backup/alternate extensions (.bak, .old, .zip, .tar.gz)
Numeric Fuzz7+/-10 variations of numeric path segments
Long Wordlist (files)9Extended filename dictionary x extensions
Long Wordlist (dirs)11Extended directory dictionary
FUZZ12Template-based fuzzing (FUZZ marker replacement)
Auto-fuzz on low-yield targets. FUZZ fuzzing is normally opt-in (--fuzz-wordlist, --intensity deep, or a discovery-only run discover). On a full scan it also turns on automatically when the preceding Spidering phase comes up empty-handed — it found very few records, or it bounced off-host to an SSO/login wall. Such sites frequently still expose hidden routes that don’t redirect, so vigolium brute-forces them anyway. The auto-fuzz always targets the original -t host (the off-host SSO/identity-provider domain is excluded from scope), and a console line announces it when it kicks in. Disable with discovery.auto_fuzz_low_yield: false.

Deduplication

Multiple layers prevent redundant work during the crawl:
  • Task-level: FNV-1a hash prevents duplicate task enqueueing
  • Request-level: Cache prevents sending the same HTTP request twice
  • URL-level: DiskSet tracks processed URLs
  • Body-level: Hash prevents re-analyzing identical responses with JSScan
  • Directory/file trackers: Prevent re-processing the same discovery

Output deduplication

Before discoveries are saved and handed to the scanner, two more layers run over the collected results:
  • Exact body dedup: Records are keyed on host + status + response-body hash. Identical bodies served across different paths collapse to one (the shortest path is kept). The hash covers only the response body — not the raw response — so volatile headers (Date, Set-Cookie) and header ordering don’t defeat it.
  • Near-identical cluster cap: A backstop for catch-all / SPA targets that answer 200 with the same page for every path (where each response differs by a few bytes or words, so the exact hash and soft-404 detection can’t collapse them). Records are grouped into clusters by host + status + content-type, with body size and word count each within 0.5%; at most dedup_cluster_cap records (default 10) are kept per cluster, preferring the shallowest paths. Because the band is relative, small distinct responses (e.g. JSON API endpoints) need a near-exact match to cluster and are never collapsed — only large near-identical pages are capped.
The cluster cap is on by default. Set discovery.dedup_cluster_cap: 0 to disable it, or raise/lower the value to keep more/fewer representatives per cluster. It trims both the report and the downstream dynamic-assessment workload, since the capped records are never emitted to the scanner.

Built-In Modules

YAML-configured modules trigger specialized tasks when matching directories are found:
ModuleTriggers OnWhat It Does
backupAny directoryTests backup extensions (.bak, .old, .zip, .tar.gz)
jsAny directoryTests .js, .mjs, .map extensions
api/api/, /v1/, etc.REST/GraphQL/SOAP endpoint wordlists
admin/admin/, /manage/Admin panel paths
docs/docs/, /api-docs/Swagger, OpenAPI, GraphQL playground
static/static/, /assets/Blocks recursion to avoid noise

Supporting Systems

ComponentPurpose
WAF DetectionIdentifies Cloudflare, Akamai, AWS WAF, F5, Imperva, Sucuri, ModSecurity. Tracks consecutive blocks for backoff/early exit
Scope EnforcementThree modes: any (no check), subdomain (same eTLD+1), exact (same host). Checked on every discovery and redirect
Case Sensitivity DetectionAuto-detected on first file discovery by re-requesting with altered casing
StorageSQLite-backed sitemap with semantic dedup (FNV-1a-64). Supports session comparison for differential scanning across runs

Integration with Vigolium

Deparos runs as an input source (DeparosDiscoverySource) in the scanning pipeline. Each discovery is converted to an httpmsg.HttpRequestResponse and fed to the executor as a work item - where it flows through active and passive vulnerability scanning modules.
DeparosDiscoverySource.Next()
  → Engine.Start() → discoveries stream out
  → Convert to httpmsg.HttpRequestResponse
  → Save to DB (optional)
  → Return as WorkItem → Executor → Scanner Modules