Deparos is an intelligent content discovery engine that performs directory enumeration, directory fuzzing, and endpoint discovery against web applications. It goes beyond static wordlist brute-forcing by learning from every response - adapting its strategy, growing its wordlists dynamically, and filtering false positives through fingerprint-based soft-404 detection.
How It Works
Target URL
│
▼
┌──────────────────────────────────────────────────────┐
│ Initialization │
│ 1. Probe target, extract host components │
│ 2. Fetch robots.txt │
│ 3. Learn baseline fingerprints (3-sample soft-404) │
│ 4. Load prior session data (if resuming) │
│ 5. Generate initial tasks from wordlists + observed │
└──────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ Priority Queue │
│ ┌────┬────┬────┬────┬─────┬──────┬──────┬────────┐ │
│ │ P0 │ P1 │ P2 │ P4 │ P5 │ P7 │ P11 │ P12 │ │
│ │Spdr│Obs │Obs │Obs │Short│ExtVar│Long │Fuzz │ │
│ │ JS │Name│File│Dir │Word │Numric│Word │ │ │
│ └────┴────┴────┴────┴─────┴──────┴──────┴────────┘ │
└──────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ Payload Coordinator │
│ Expander pulls tasks → Expand() yields payloads │
│ N workers execute payloads concurrently │
│ │
│ For each response: │
│ Fingerprint check (soft-404?) ──→ discard │
│ WAF detection ──→ track/backoff │
│ Real discovery ──→ callbacks │
└──────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────┐
│ Discovery Callbacks │
│ OnDirectoryDiscovered(): │
│ • Learn new fingerprints for directory │
│ • Create recursive tasks (wordlists + observed) │
│ • Extract breadcrumb directories │
│ OnFileDiscovered(): │
│ • Extract extension → trigger extension tasks │
│ • Numeric segment → fuzz ±10 variations │
│ • Queue extension variant probes (.bak, .old, …) │
└──────────────────┬───────────────────────────────────┘
▼
┌──── loop back to Priority Queue ────┐
│ (new tasks from discoveries) │
└─────────────────────────────────────┘
What Makes It Adaptive
1. Fingerprint-Based Soft-404 Detection
Before scanning, the engine requests 3 random non-existent paths and extracts response attributes (status code, content-type, headers, body hash, content-length ranges). Only attributes stable across all 3 samples become the baseline signature. During scanning, responses matching this signature are discarded as false positives.
When an unknown response pattern appears, a 4-strategy wildcard validation (prefix, suffix, extension, middle) confirms whether the discovery is real or a new soft-404 variant - and learns the new pattern.
2. Observed Collection System
Four data pools grow continuously during the scan:
| Pool | Source | Priority |
|---|
| Observed Names | Spider links, JS parsing, response body tokenization | P1 |
| Observed Files | Complete filenames from discoveries | P2 |
| Observed Extensions | File extensions from discoveries | P5 |
| Observed Paths | Full path segments from URLs | P4 |
Every newly discovered directory is probed with ALL observed values as high-priority tasks. When a new extension is found for the first time, it triggers tasks across ALL known directories.
3. JavaScript Intelligence
Two layers of JS analysis feed endpoints back into the discovery queue:
- JSScan (embedded binary): Deobfuscates bundled JS, resolves string concatenation, traces variable assignments, and extracts
fetch() / XMLHttpRequest / $.ajax call sites into full HTTP request specs.
- Spider extractors: Parse inline
<script> tags and JS string literals for URL patterns.
Extracted endpoints become priority-0 tasks - tested before any wordlist fuzzing.
4. Dynamic Wordlist Growth
Response bodies are tokenized (content-type-aware for HTML, JSON, JS, CSS) to extract candidate words. These feed into the observed name pool and are replayed against every directory.
5. Recursive Directory Expansion
When a file is found at /a/b/c/file.txt, the engine extracts /a/, /a/b/, /a/b/c/ as directories to test. Each new directory triggers its own full task set (wordlists + observed + modules).
Task Types
| Task | Priority | Description |
|---|
| Spider/JS Extracted | 0 | URLs from link extraction and JS analysis |
| Observed Names | 1 | Filenames seen during scan, replayed per directory |
| Observed Files | 2 | Complete name+extension pairs |
| Observed Paths | 4 | Full path segments from URLs |
| Short Wordlist (files) | 5 | Common filenames from short wordlist x extensions |
| Short Wordlist (dirs) | 6 | Common directory names from short wordlist |
| Extension Variants | 7 | Backup/alternate extensions (.bak, .old, .zip, .tar.gz) |
| Numeric Fuzz | 7 | +/-10 variations of numeric path segments |
| Long Wordlist (files) | 9 | Extended filename dictionary x extensions |
| Long Wordlist (dirs) | 11 | Extended directory dictionary |
| FUZZ | 12 | Template-based fuzzing (FUZZ marker replacement) |
Auto-fuzz on low-yield targets. FUZZ fuzzing is normally opt-in (--fuzz-wordlist, --intensity deep, or a discovery-only run discover). On a full scan it also turns on automatically when the preceding Spidering phase comes up empty-handed — it found very few records, or it bounced off-host to an SSO/login wall. Such sites frequently still expose hidden routes that don’t redirect, so vigolium brute-forces them anyway. The auto-fuzz always targets the original -t host (the off-host SSO/identity-provider domain is excluded from scope), and a console line announces it when it kicks in. Disable with discovery.auto_fuzz_low_yield: false.
Deduplication
Multiple layers prevent redundant work during the crawl:
- Task-level: FNV-1a hash prevents duplicate task enqueueing
- Request-level: Cache prevents sending the same HTTP request twice
- URL-level: DiskSet tracks processed URLs
- Body-level: Hash prevents re-analyzing identical responses with JSScan
- Directory/file trackers: Prevent re-processing the same discovery
Output deduplication
Before discoveries are saved and handed to the scanner, two more layers run over the collected results:
- Exact body dedup: Records are keyed on
host + status + response-body hash. Identical bodies served across different paths collapse to one (the shortest path is kept). The hash covers only the response body — not the raw response — so volatile headers (Date, Set-Cookie) and header ordering don’t defeat it.
- Near-identical cluster cap: A backstop for catch-all / SPA targets that answer
200 with the same page for every path (where each response differs by a few bytes or words, so the exact hash and soft-404 detection can’t collapse them). Records are grouped into clusters by host + status + content-type, with body size and word count each within 0.5%; at most dedup_cluster_cap records (default 10) are kept per cluster, preferring the shallowest paths. Because the band is relative, small distinct responses (e.g. JSON API endpoints) need a near-exact match to cluster and are never collapsed — only large near-identical pages are capped.
The cluster cap is on by default. Set discovery.dedup_cluster_cap: 0 to disable it, or raise/lower the value to keep more/fewer representatives per cluster. It trims both the report and the downstream dynamic-assessment workload, since the capped records are never emitted to the scanner.
Built-In Modules
YAML-configured modules trigger specialized tasks when matching directories are found:
| Module | Triggers On | What It Does |
|---|
backup | Any directory | Tests backup extensions (.bak, .old, .zip, .tar.gz) |
js | Any directory | Tests .js, .mjs, .map extensions |
api | /api/, /v1/, etc. | REST/GraphQL/SOAP endpoint wordlists |
admin | /admin/, /manage/ | Admin panel paths |
docs | /docs/, /api-docs/ | Swagger, OpenAPI, GraphQL playground |
static | /static/, /assets/ | Blocks recursion to avoid noise |
Supporting Systems
| Component | Purpose |
|---|
| WAF Detection | Identifies Cloudflare, Akamai, AWS WAF, F5, Imperva, Sucuri, ModSecurity. Tracks consecutive blocks for backoff/early exit |
| Scope Enforcement | Three modes: any (no check), subdomain (same eTLD+1), exact (same host). Checked on every discovery and redirect |
| Case Sensitivity Detection | Auto-detected on first file discovery by re-requesting with altered casing |
| Storage | SQLite-backed sitemap with semantic dedup (FNV-1a-64). Supports session comparison for differential scanning across runs |
Integration with Vigolium
Deparos runs as an input source (DeparosDiscoverySource) in the scanning pipeline. Each discovery is converted to an httpmsg.HttpRequestResponse and fed to the executor as a work item - where it flows through active and passive vulnerability scanning modules.
DeparosDiscoverySource.Next()
→ Engine.Start() → discoveries stream out
→ Convert to httpmsg.HttpRequestResponse
→ Save to DB (optional)
→ Return as WorkItem → Executor → Scanner Modules