Data & Persistence Architecture

Architecture series: Overview · Native Scan · Agentic Scan · Data & Storage · Server & API

Every scan path in Vigolium, native, agentic, ingestion, converges on the same persistence layer. This document explains how data is scoped (multi-tenancy), modeled (the database schema), written (the repository pattern and async writer), and moved between machines (cloud storage). It is the architectural companion to the task-oriented Projects API and Storage API guides; reach for those for CLI/API recipes.

1. Multi-Tenancy: the `project_uuid` spine

All scan data is partitioned by project, a named container with a UUID, optional config overlay, and optional access-control lists. There is no separate database per project; isolation is a project_uuid column on every major table, filtered on every read and stamped on every write.

Built-in defaults
  → ~/.vigolium/vigolium-configs.yaml          (global config)
    → ~/.vigolium/projects/<uuid>/config.yaml  (per-project overlay)
      → --scanning-profile                     (scanning profile)
        → CLI flags                            (highest precedence)

Default project: 00000000-0000-0000-0000-000000000001, created during vigolium init. Used whenever no project is selected.
Selection precedence: --project-uuid > --project-name > VIGOLIUM_PROJECT_UUID > VIGOLIUM_PROJECT (legacy) > default. On the server, the X-Project-UUID request header plays the same role.
Project config is a partial YAML overlay (same shape as a scanning profile) at ~/.vigolium/projects/<uuid>/config.yaml; only the keys it sets are overridden.
Access control: allowed_emails / allowed_domains on the project row gate server requests that carry X-User-Email, exact-email list wins, else domain list, else open; a missing email header skips the check; denial is 403. VIGOLIUM_PROJECT_READONLY=true disables all mutating project CLI subcommands.

Tables carrying `project_uuid`

scans · http_records · findings · scopes · source_repos · oast_interactions · scan_logs Existing databases are migrated in place, the column is added with the default-project UUID as its default, so pre-multi-tenancy data lands in the default project.

2. The database backend

Vigolium uses the repository pattern over Bun ORM, with two interchangeable backends:

Backend	Driver	Use
SQLite (default)	`sqliteshim` → modernc	Single-binary, zero-config, local scans
PostgreSQL	`pgdriver`	Shared/server deployments, concurrent writers

The schema is intentionally denormalized, there are no separate hosts or parameters tables; JSONB columns carry structured sub-data. This keeps a single http_records row self-contained and avoids join fan-out on the hot ingestion path.

Core models (`pkg/database/models.go`)

HTTPRecord (table http_records), the unit of ingested traffic:

Identity: UUID (PK), RequestHash (SHA-256 of the raw request, used for per-source dedup)
Host: Scheme, Hostname, Port, IP
Request: Method, Path, URL, RequestHeaders (JSONB), RawRequest, RequestBody
Response: StatusCode, ResponseHeaders (JSONB), RawResponse, ResponseBody, ResponseTitle, ResponseWords
Derived: Parameters (JSONB array of EmbeddedParam), RiskScore, Remarks
Metadata: Source, SentAt, ReceivedAt, CreatedAt

Finding (table findings), a detected issue:

Identity: ID (auto-increment), FindingHash (unique constraint → dedup key, set from the ResultEvent ID)
Module: ModuleID, ModuleName, Description, Severity, Confidence
Evidence: MatchedAt (JSONB), ExtractedResults, Request, Response, AdditionalEvidence (merged-duplicate request/response pairs, capped at 10)
Relations: HTTPRecordUUIDs (JSONB), ScanUUID; the finding_records junction table is the many-to-many link to HTTP records.

Agent-produced findings (autopilot/swarm/vigolium-audit/piolium) flow into the same findings table, tagged by source so native and AI results coexist and dedup together.

Converters (`pkg/database/converters.go`)

The in-memory scan types never touch the DB directly. HTTPRecord.FromHttpRequestResponse() and Finding.FromResultEvent() are the only seam, they generate UUIDs, compute hashes, parse URLs, extract titles, and count words, keeping persistence concerns out of the executor and modules.

3. Write paths

Method (`pkg/database/repository.go`)	Role
`SaveRecord()` / `SaveRecordsBatch()`	Single vs. bulk INSERT (batch = one transaction)
`SaveFinding()`	`INSERT … ON CONFLICT (finding_hash) DO NOTHING` + evidence append + junction rows
`DeduplicateFindings()`	Post-phase grouping: merge findings sharing `(module_id, severity, matched_at URL)`
`CreateScanWithCursor()` / `CountRecordsAfterCursor()`	Cursor bookkeeping for incremental rescans
`GetRecordsWithResponseBody()`	UUID-cursor pagination for batch scanners (e.g. Kingfisher)
`UpdateRiskScores()`	Batched `CASE/WHEN` UPDATE, 500 UUIDs per statement

Async batched ingestion, `RecordWriter`

High-throughput ingestion (proxy capture, bulk import, spidering) does not call the repository synchronously. pkg/database/record_writer.go fronts it with a buffered channel:

Write() ──► buffered chan (cap 4096) ──► single flushLoop goroutine
                                            │  batch of 128, or 50ms tick
                                            ▼
                                   repo.SaveRecordsBatch()  (one txn)

Each caller blocks until its row is flushed and gets a WriteResult{UUID, Err} back on a per-request result channel, backpressure is the channel capacity, ordering is preserved, and the DB sees large batched transactions instead of a write per request.

SQLite DSN note: the modernc driver needs pragmas in _pragma=name(value) form; the mattn-style _busy_timeout= is silently ignored. Relevant when tuning concurrent-writer behavior.

4. Deduplication

Two layers, by design:

Per-source HTTP-record dedup: RequestHash (SHA-256 of the raw request) plus DeduplicateRecordsBySource collapses re-ingested identical requests within a source.
Finding dedup: the finding_hash unique constraint prevents exact duplicates at insert time; DeduplicateFindings() runs after a phase to group near-duplicates (same module/severity/URL), folding the extra request/response pairs into the survivor’s AdditionalEvidence (capped). The multi-driver audit command runs an additional project-wide findings dedup pass once its drivers exit.

5. Cloud storage (optional)

Storage is disabled by default. When enabled (storage.enabled: true), a single minio-go S3 client talks to GCS (HMAC), S3, or self-hosted MinIO, the driver differs, the rest is identical.

gs://<project-uuid>/<key>   ⇒   s3://<storage.bucket>/<project-uuid>/<key>

The key architectural point: the project UUID is the in-bucket prefix, not the bucket name. Every key is validated (storage.ValidateKey rejects .., backslashes, absolute paths) and project-prefixed server-side, so one bucket safely holds many projects and clients cannot reach outside their own.

Conventional prefix	Producer
`ugc/<file>`	`vigolium storage upload` (default)
`imports/<base>-<ts>.<ext>`	`vigolium import --upload`
`native-scans/<scan-uuid>/results.tar.gz`	`vigolium scan --upload-results`
`agentic-scans/<run-uuid>/results.tar.gz`	`vigolium agent … --upload-results`

gs:// URLs are first-class inputs/outputs: vigolium import gs://… downloads-then-imports (detecting archon folders or JSONL inside .tar.gz/.zip), and any export -o gs://… writes locally then uploads on success. The {ts} and {project-uuid} placeholders expand in any -o path. The bundle export format round-trips a full snapshot (JSONL + HTML report + manifest + agent session dirs) that another machine can re-import.

Projects API

Project CLI/API recipes and access-control management.

Storage API

Full vigolium storage command and gs:// reference.

Native Scan

Stage 11 traces a finding from ResultEvent to row.

Configuration

The storage: and project config YAML blocks.

Getting Started

Native Scan

Agentic Scan

Architecture

Native Scanning Phases

Server Mode

Customization

Others

1. Multi-Tenancy: the `project_uuid` spine

Tables carrying `project_uuid`

2. The database backend

Core models (`pkg/database/models.go`)

Converters (`pkg/database/converters.go`)

3. Write paths

Async batched ingestion, `RecordWriter`

4. Deduplication

5. Cloud storage (optional)

Projects API

Storage API

Native Scan

Configuration

Getting Started

Native Scan

Agentic Scan

Architecture

Native Scanning Phases

Server Mode

Customization

Others

Documentation Index

​1. Multi-Tenancy: the project_uuid spine

​Tables carrying project_uuid

​2. The database backend

​Core models (pkg/database/models.go)

​Converters (pkg/database/converters.go)

​3. Write paths

​Async batched ingestion, RecordWriter

​4. Deduplication

​5. Cloud storage (optional)

​Related

Projects API

Storage API

Native Scan

Configuration

1. Multi-Tenancy: the `project_uuid` spine

Tables carrying `project_uuid`

2. The database backend

Core models (`pkg/database/models.go`)

Converters (`pkg/database/converters.go`)

3. Write paths

Async batched ingestion, `RecordWriter`

4. Deduplication

5. Cloud storage (optional)

Related