Deparos - Modern Adaptive Content Discovery

Deparos is an intelligent content discovery engine that performs directory enumeration, directory fuzzing, and endpoint discovery against web applications. It goes beyond static wordlist brute-forcing by learning from every response - adapting its strategy, growing its wordlists dynamically, and filtering false positives through fingerprint-based soft-404 detection.

How It Works

Target URL
  │
  ▼
┌──────────────────────────────────────────────────────┐
│  Initialization                                      │
│  1. Probe target, extract host components            │
│  2. Fetch robots.txt                                 │
│  3. Learn baseline fingerprints (3-sample soft-404)  │
│  4. Load prior session data (if resuming)            │
│  5. Generate initial tasks from wordlists + observed │
└──────────────────┬───────────────────────────────────┘
                   ▼
┌──────────────────────────────────────────────────────┐
│  Priority Queue                                      │
│  ┌────┬────┬────┬────┬─────┬──────┬──────┬────────┐ │
│  │ P0 │ P1 │ P2 │ P4 │ P5  │ P7   │ P11  │ P12   │ │
│  │Spdr│Obs │Obs │Obs │Short│ExtVar│Long  │Fuzz   │ │
│  │ JS │Name│File│Dir │Word │Numric│Word  │       │ │
│  └────┴────┴────┴────┴─────┴──────┴──────┴────────┘ │
└──────────────────┬───────────────────────────────────┘
                   ▼
┌──────────────────────────────────────────────────────┐
│  Payload Coordinator                                 │
│  Expander pulls tasks → Expand() yields payloads     │
│  N workers execute payloads concurrently             │
│                                                      │
│  For each response:                                  │
│    Fingerprint check (soft-404?) ──→ discard         │
│    WAF detection ──→ track/backoff                   │
│    Real discovery ──→ callbacks                      │
└──────────────────┬───────────────────────────────────┘
                   ▼
┌──────────────────────────────────────────────────────┐
│  Discovery Callbacks                                 │
│  OnDirectoryDiscovered():                            │
│    • Learn new fingerprints for directory             │
│    • Create recursive tasks (wordlists + observed)   │
│    • Extract breadcrumb directories                  │
│  OnFileDiscovered():                                 │
│    • Extract extension → trigger extension tasks     │
│    • Numeric segment → fuzz ±10 variations           │
│    • Queue extension variant probes (.bak, .old, …)  │
└──────────────────┬───────────────────────────────────┘
                   ▼
        ┌──── loop back to Priority Queue ────┐
        │  (new tasks from discoveries)       │
        └─────────────────────────────────────┘

What Makes It Adaptive

1. Fingerprint-Based Soft-404 Detection

Before scanning, the engine requests 3 random non-existent paths and extracts response attributes (status code, content-type, headers, body hash, content-length ranges). Only attributes stable across all 3 samples become the baseline signature. During scanning, responses matching this signature are discarded as false positives. When an unknown response pattern appears, a 4-strategy wildcard validation (prefix, suffix, extension, middle) confirms whether the discovery is real or a new soft-404 variant - and learns the new pattern.

2. Observed Collection System

Four data pools grow continuously during the scan:

Pool	Source	Priority
Observed Names	Spider links, JS parsing, response body tokenization	P1
Observed Files	Complete filenames from discoveries	P2
Observed Extensions	File extensions from discoveries	P5
Observed Paths	Full path segments from URLs	P4

Every newly discovered directory is probed with ALL observed values as high-priority tasks. When a new extension is found for the first time, it triggers tasks across ALL known directories.

3. JavaScript Intelligence

Two layers of JS analysis feed endpoints back into the discovery queue:

JSScan (embedded binary): Deobfuscates bundled JS, resolves string concatenation, traces variable assignments, and extracts fetch() / XMLHttpRequest / $.ajax call sites into full HTTP request specs.
Spider extractors: Parse inline <script> tags and JS string literals for URL patterns.

Extracted endpoints become priority-0 tasks - tested before any wordlist fuzzing.

4. Dynamic Wordlist Growth

Response bodies are tokenized (content-type-aware for HTML, JSON, JS, CSS) to extract candidate words. These feed into the observed name pool and are replayed against every directory.

5. Recursive Directory Expansion

When a file is found at /a/b/c/file.txt, the engine extracts /a/, /a/b/, /a/b/c/ as directories to test. Each new directory triggers its own full task set (wordlists + observed + modules).

Task Types

Task	Priority	Description
Spider/JS Extracted	0	URLs from link extraction and JS analysis
Observed Names	1	Filenames seen during scan, replayed per directory
Observed Files	2	Complete name+extension pairs
Observed Paths	4	Full path segments from URLs
Short Wordlist (files)	5	Common filenames from short wordlist x extensions
Short Wordlist (dirs)	6	Common directory names from short wordlist
Extension Variants	7	Backup/alternate extensions (.bak, .old, .zip, .tar.gz)
Numeric Fuzz	7	+/-10 variations of numeric path segments
Long Wordlist (files)	9	Extended filename dictionary x extensions
Long Wordlist (dirs)	11	Extended directory dictionary
FUZZ	12	Template-based fuzzing (`FUZZ` marker replacement)

Auto-fuzz on low-yield targets. FUZZ fuzzing is normally opt-in (--fuzz-wordlist, --intensity deep, or a discovery-only run discover). On a full scan it also turns on automatically when the preceding Spidering phase comes up empty-handed — it found very few records, or it bounced off-host to an SSO/login wall. Such sites frequently still expose hidden routes that don’t redirect, so vigolium brute-forces them anyway. The auto-fuzz always targets the original -t host (the off-host SSO/identity-provider domain is excluded from scope), and a console line announces it when it kicks in. Disable with discovery.auto_fuzz_low_yield: false.

Deduplication

Multiple layers prevent redundant work during the crawl:

Task-level: FNV-1a hash prevents duplicate task enqueueing
Request-level: Cache prevents sending the same HTTP request twice
URL-level: DiskSet tracks processed URLs
Body-level: Hash prevents re-analyzing identical responses with JSScan
Directory/file trackers: Prevent re-processing the same discovery

Output deduplication

Before discoveries are saved and handed to the scanner, two more layers run over the collected results:

Exact body dedup: Records are keyed on host + status + response-body hash. Identical bodies served across different paths collapse to one (the shortest path is kept). The hash covers only the response body — not the raw response — so volatile headers (Date, Set-Cookie) and header ordering don’t defeat it.
Near-identical cluster cap: A backstop for catch-all / SPA targets that answer 200 with the same page for every path (where each response differs by a few bytes or words, so the exact hash and soft-404 detection can’t collapse them). Records are grouped into clusters by host + status + content-type, with body size and word count each within 0.5%; at most dedup_cluster_cap records (default 10) are kept per cluster, preferring the shallowest paths. Because the band is relative, small distinct responses (e.g. JSON API endpoints) need a near-exact match to cluster and are never collapsed — only large near-identical pages are capped.

The cluster cap is on by default. Set discovery.dedup_cluster_cap: 0 to disable it, or raise/lower the value to keep more/fewer representatives per cluster. It trims both the report and the downstream dynamic-assessment workload, since the capped records are never emitted to the scanner.

Built-In Modules

YAML-configured modules trigger specialized tasks when matching directories are found:

Module	Triggers On	What It Does
`backup`	Any directory	Tests backup extensions (.bak, .old, .zip, .tar.gz)
`js`	Any directory	Tests .js, .mjs, .map extensions
`api`	`/api/`, `/v1/`, etc.	REST/GraphQL/SOAP endpoint wordlists
`admin`	`/admin/`, `/manage/`	Admin panel paths
`docs`	`/docs/`, `/api-docs/`	Swagger, OpenAPI, GraphQL playground
`static`	`/static/`, `/assets/`	Blocks recursion to avoid noise

Supporting Systems

Component	Purpose
WAF Detection	Identifies Cloudflare, Akamai, AWS WAF, F5, Imperva, Sucuri, ModSecurity. Tracks consecutive blocks for backoff/early exit
Scope Enforcement	Three modes: `any` (no check), `subdomain` (same eTLD+1), `exact` (same host). Checked on every discovery and redirect
Case Sensitivity Detection	Auto-detected on first file discovery by re-requesting with altered casing
Storage	SQLite-backed sitemap with semantic dedup (FNV-1a-64). Supports session comparison for differential scanning across runs

Integration with Vigolium

Deparos runs as an input source (DeparosDiscoverySource) in the scanning pipeline. Each discovery is converted to an httpmsg.HttpRequestResponse and fed to the executor as a work item - where it flows through active and passive vulnerability scanning modules.

DeparosDiscoverySource.Next()
  → Engine.Start() → discoveries stream out
  → Convert to httpmsg.HttpRequestResponse
  → Save to DB (optional)
  → Return as WorkItem → Executor → Scanner Modules

​How It Works

​What Makes It Adaptive

​1. Fingerprint-Based Soft-404 Detection

​2. Observed Collection System

​3. JavaScript Intelligence

​4. Dynamic Wordlist Growth

​5. Recursive Directory Expansion

​Task Types

​Deduplication

​Output deduplication

​Built-In Modules

​Supporting Systems

​Integration with Vigolium

How It Works

What Makes It Adaptive

1. Fingerprint-Based Soft-404 Detection

2. Observed Collection System

3. JavaScript Intelligence

4. Dynamic Wordlist Growth

5. Recursive Directory Expansion

Task Types

Deduplication

Output deduplication

Built-In Modules

Supporting Systems

Integration with Vigolium