Table of Contents
- Overview
- Architecture
- Directory Structure
- Test Layers
- Fixture System
- Running Benchmarks
- YAML Definition Format
- Generating Fixtures
- Adding New Benchmarks
- Harness Reference
- CI Integration
- Troubleshooting
Overview
The whitebox agent benchmarks validate the four layers of the agent-driven scanning pipeline:| Layer | What It Tests | Key Function | Assertion |
|---|---|---|---|
| 1. Parsing | ParseFindings / ParseHTTPRecords extract structured data from raw agent output | agent.ParseFindings(), agent.ParseHTTPRecords() | Strict |
| 2. Quality | Agent-reported findings contain expected CWEs, vuln types, and severity distribution | agent.ParseFindings() + field inspection | Soft |
| 3. Handoff | Agent HTTP records convert to HttpRequestResponse via ToHTTPRequestResponse | agent.ToHTTPRequestResponse() | Strict |
| 4. E2E | Converted HTTP records produce real findings when scanned against Docker apps | Modules’ ScanPerInsertionPoint / ScanPerRequest | Soft |
Why Fixtures?
AI agent calls are expensive and non-deterministic. Rather than calling the LLM on every test run, the system:- Runs the agent once against source code stubs (the same stubs used by the SAST benchmark)
- Caches the raw output as JSON fixture files
- Runs all subsequent benchmark tests against those cached fixtures
Strict vs Soft Assertions
- Strict (Layers 1 & 3): These layers test our code — the JSON parser and the HRR converter. If they fail, we have a bug. The test should fail.
- Soft (Layers 2 & 4): These layers validate agent data quality and end-to-end detection. Agent output quality varies across models and regenerations. Soft assertions log warnings but do not fail the test.
Relationship to Other Benchmarks
| Benchmark | Build Tag | Tests | Requirements |
|---|---|---|---|
| SAST (Layers 1-3) | sast | Route extraction, SARIF parsing, handoff | ast-grep binary |
| SAST E2E (Layer 4) | sast_e2e | Full source-to-scan pipeline | ast-grep binary |
| Agent (Layers 1-3) | agent_benchmark | Parsing, quality, HRR conversion | None |
| Agent E2E (Layer 4) | agent_benchmark + canary | Scan converted records against Docker apps | Docker |
| Agent Generate | agent_generate | Populate fixture files from real LLM | Configured agent |
Architecture
Key Design Decisions
- Fixture-first: Agent output is cached as JSON. Tests never call the LLM. This makes all layers fast and deterministic.
- One fixture per (stub x template) pair: Each fixture captures the raw output from running a specific prompt template against a specific source code stub.
- YAML-driven definitions: Test expectations are declared in YAML files. Adding a new test case is a YAML addition, not a code change.
- Reuses SAST source stubs: The same framework stubs in
test/testdata/sast-stubs/serve as agent input, avoiding duplication. - Shared harness: Type definitions and loaders live in
test/benchmark/harness/, following the same pattern as the SAST benchmark (sast_types.go/sast_loader.go).
Directory Structure
Test Layers
Layer 1: Parsing
Tests the JSON extraction and parsing logic that converts raw agent text output (which may contain markdown fences, preamble text, or bare arrays) into structuredAgentFinding or AgentHTTPRecord slices.
What it validates:
agent.ParseFindings()correctly extracts findings from raw outputagent.ParseHTTPRecords()correctly extracts HTTP records from raw output- Finding/record counts match the expected values from YAML definitions
- Required fields (title, severity, CWE, method, URL) are non-empty
- Markdown fence stripping works (
```json ... ```) - Bare JSON arrays parse correctly (
[{...}]without a wrapper object) - Empty and malformed input returns appropriate errors
| Fixture | Schema | Count |
|---|---|---|
gin-security-code-review.json | findings | 3 |
gin-endpoint-discovery.json | http_records | 8 |
flask-security-code-review.json | findings | 4 |
flask-endpoint-discovery.json | http_records | 7 |
| Function | Description |
|---|---|
TestParsing_All | Runs all parsing definitions from YAML |
TestParsing_GinFindings | Gin findings parsing |
TestParsing_GinRecords | Gin HTTP records parsing |
TestParsing_FlaskFindings | Flask findings parsing |
TestParsing_FlaskRecords | Flask HTTP records parsing |
TestParsing_EmptyOutput | Empty/whitespace input → error |
TestParsing_MalformedJSON | Invalid JSON → error |
TestParsing_MarkdownFences | JSON inside ``` fences parses correctly |
TestParsing_BareArray | Bare [{...}] array parses correctly |
Layer 2: Quality
Validates that agent-reported findings contain meaningful security information. These are soft assertions — they measure agent quality, not code correctness. What it validates:- Finding count falls within expected range (
min_findings/max_findings) - Expected CWE identifiers are present in the findings (searched by
AgentFinding.CWE) - Expected vulnerability types appear in finding titles or tags
- Severity distribution meets minimum thresholds (e.g., at least 1 high, at least 2 medium)
| Fixture | Stub | Expected CWEs | Key Vuln Types |
|---|---|---|---|
gin-security-code-review.json | gin | CWE-20 | Input validation |
flask-security-code-review.json | flask | CWE-79 | XSS, reflection |
express-security-code-review.json | express | CWE-79, CWE-798 | XSS, hardcoded credentials |
django-security-code-review.json | django | CWE-79 | XSS, reflection |
fastapi-security-code-review.json | fastapi | CWE-306 | Missing authentication |
| Function | Description |
|---|---|
TestQuality_All | Runs all quality definitions from YAML |
TestQuality_GinSecurityReview | Gin quality validation |
TestQuality_FlaskSecurityReview | Flask quality validation |
TestQuality_ExpressSecurityReview | Express quality validation |
TestQuality_DjangoSecurityReview | Django quality validation |
TestQuality_FastAPISecurityReview | FastAPI quality validation |
Layer 3: Handoff
Tests the conversion of agent-reported HTTP records into Vigolium’s internalHttpRequestResponse format via agent.ToHTTPRequestResponse(). This is the bridge between agent output and the DAST scanning engine.
What it validates:
agent.ToHTTPRequestResponse()successfully converts records with method, URL, headers, and body- Convertible and skipped record counts match expectations
- Specific records can be found by method and URL prefix
- Converted requests have the correct HTTP method
- Host headers are preserved after conversion
- Empty URLs produce an error (not a nil pointer)
- Empty methods default to GET
| Fixture | Records | Convertible | Validated Records |
|---|---|---|---|
gin-endpoint-discovery.json | 8 | 8 | GET /users, POST /users, GET /health, GET /api/v2/items |
flask-endpoint-discovery.json | 7 | 7 | GET /users, POST /users, GET /health |
express-endpoint-discovery.json | 8 | 8 | GET /api/v1/users, POST /api/v1/users, POST /login |
fastapi-endpoint-discovery.json | 10 | 10 | GET /users, POST /users, GET /items, OPTIONS /config |
| Function | Description |
|---|---|
TestHandoff_All | Runs all handoff definitions from YAML |
TestHandoff_GinEndpoints | Gin endpoint conversion |
TestHandoff_FlaskEndpoints | Flask endpoint conversion |
TestHandoff_ExpressEndpoints | Express endpoint conversion |
TestHandoff_FastAPIEndpoints | FastAPI endpoint conversion |
TestHandoff_EmptyURL | Empty URL → error |
TestHandoff_DefaultMethod | Empty method → GET |
Layer 4: E2E
Tests the full pipeline: load cached HTTP records from a fixture, rewrite their URLs to point at a running Docker vulnerable app, convert them to HRR, create insertion points, and run active scanner modules. This validates that agent-generated requests actually produce findings against real applications. What it validates:- Cached HTTP records load and parse correctly
- URL rewriting updates host, scheme, and Host header
- Records convert to HRR and produce insertion points
- Active scanner modules find vulnerabilities when given agent-generated requests
- Minimum finding count meets the (soft) threshold
| Function | Description |
|---|---|
TestE2E_All | Runs all E2E definitions from YAML |
TestE2E_VAmPI | VAmPI agent scan (SQLi + XSS modules) |
Fixture System
Fixture Format
Each fixture is a JSON file with three top-level fields:| Field | Description |
|---|---|
metadata.stub | Source stub directory name (e.g., gin, flask) |
metadata.template | Prompt template ID used to generate output |
metadata.agent_name | Agent backend name from config |
metadata.output_schema | Expected output format: findings or http_records |
metadata.generated_at | UTC timestamp when the fixture was generated |
metadata.agent_model | Model identifier (optional, for provenance) |
raw_output | The complete raw text output from the agent, including any markdown fences or preamble |
parsed.finding_count | Pre-computed finding count (for quick reference) |
parsed.record_count | Pre-computed record count (for quick reference) |
Fixture Matrix
The current fixture set covers 5 frameworks and 2-3 templates each:| Stub | Template | Schema | Fixture File |
|---|---|---|---|
| gin | security-code-review | findings | gin-security-code-review.json |
| gin | endpoint-discovery | http_records | gin-endpoint-discovery.json |
| gin | api-input-gen | http_records | gin-api-input-gen.json |
| flask | security-code-review | findings | flask-security-code-review.json |
| flask | endpoint-discovery | http_records | flask-endpoint-discovery.json |
| flask | api-input-gen | http_records | flask-api-input-gen.json |
| express | security-code-review | findings | express-security-code-review.json |
| express | endpoint-discovery | http_records | express-endpoint-discovery.json |
| django | security-code-review | findings | django-security-code-review.json |
| django | endpoint-discovery | http_records | django-endpoint-discovery.json |
| fastapi | security-code-review | findings | fastapi-security-code-review.json |
| fastapi | endpoint-discovery | http_records | fastapi-endpoint-discovery.json |
Staleness
Fixtures include agenerated_at timestamp. The generator (generate_test.go) skips fixtures that are less than 30 days old by default. To force regeneration, delete the fixture file and re-run the generator.
Running Benchmarks
Make Targets
| Command | Layers | Timeout | Requirements |
|---|---|---|---|
make test-agent-benchmark | 1 + 2 + 3 | 10 min | None |
make test-agent-parsing | 1 only | 5 min | None |
make test-agent-quality | 2 only | 5 min | None |
make test-agent-handoff | 3 only | 5 min | None |
make test-agent-benchmark-e2e | 1 + 2 + 3 + 4 | 20 min | Docker |
make benchmark-agent-generate | (generator) | 30 min | Configured agent (real LLM) |
Running Individual Tests
Example Output
YAML Definition Format
Parsing Definition (Layer 1)
Each file indefinitions/whitebox/agent/parsing/ declares expected parsing results for one fixture:
Quality Definition (Layer 2)
Each file indefinitions/whitebox/agent/quality/ declares expected quality metrics:
Handoff Definition (Layer 3)
Each file indefinitions/whitebox/agent/handoff/ declares expected conversion results:
E2E Definition (Layer 4)
Each file indefinitions/whitebox/agent/e2e/ declares an end-to-end scan test:
Generating Fixtures
Fixture generation runs the real agent against source stubs and writes the JSON fixture files. This is expensive (real LLM API calls) and is designed to be run infrequently.Prerequisites
- A configured agent in
~/.vigolium/vigolium-configs.yaml:
- The agent backend must be installed and authenticated (e.g., Claude CLI, OpenCode, etc.).
Running the Generator
What the Generator Does
For each (stub x template) pair in the matrix:- Checks if the fixture file exists and is less than 30 days old. If so, skips.
- Loads the source stub from
test/testdata/sast-stubs/<stub>/. - Runs
engine.Run()with the specified prompt template and source code. - Captures the raw output.
- Pre-parses the output to populate
parsed.finding_count/parsed.record_count. - Writes the fixture JSON to
test/testdata/agent-fixtures/<stub>-<template>.json.
After Generation
After generating fixtures, you should:- Review the raw output in each fixture file for correctness.
- Update the YAML definition counts to match the actual fixture data.
- Run
make test-agent-benchmarkto verify everything passes. - Commit the fixture files — they are checked into the repository.
Adding New Benchmarks
Adding a New Fixture (New Stub x Template Pair)
- Add the pair to the generator matrix in
test/benchmark/agent/generate_test.go:
-
Create the source stub (if it doesn’t exist) at
test/testdata/sast-stubs/spring/. - Run the generator to create the fixture:
- Create YAML definitions for each layer you want to test (see below).
Adding a New Parsing Test (Layer 1)
Create a YAML file attest/benchmark/definitions/whitebox/agent/parsing/<name>.yaml:
TestParsing_All automatically picks up new YAML files.
Adding a New Quality Test (Layer 2)
Create a YAML file attest/benchmark/definitions/whitebox/agent/quality/<name>.yaml:
Adding a New Handoff Test (Layer 3)
Create a YAML file attest/benchmark/definitions/whitebox/agent/handoff/<name>.yaml:
Adding a New E2E Test (Layer 4)
Create a YAML file attest/benchmark/definitions/whitebox/agent/e2e/<name>.yaml:
Harness Reference
Agent Types
| Type | Description |
|---|---|
AgentFixture | Cached agent output: metadata, raw output, pre-parsed counts |
AgentFixtureMetadata | Provenance: stub, template, agent name, schema, timestamp, model |
AgentFixtureParsed | Pre-computed finding/record counts |
AgentParsingDefinition | Layer 1: fixture path, output schema, expected counts and required fields |
AgentParsingExpected | Finding count, record count, error flag, required fields |
AgentRequiredField | Field name + non-empty constraint |
AgentQualityDefinition | Layer 2: fixture, stub, template, assertion mode, expected quality metrics |
AgentQualityExpected | Min/max findings, expected CWEs, vuln types, severity distribution |
AgentHandoffDefinition | Layer 3: fixture, expected convertible/skipped counts, specific records |
AgentHandoffExpected | Convertible count, skipped count, expected records |
AgentExpectedRecord | Method, URL prefix, has-host flag, assertion mode |
AgentE2EDefinition | Layer 4: fixture, target app config, scan config, expected findings |
AgentE2EApp | App name, compose file, base URL, wait path |
AgentE2EScanConfig | Module IDs, max records |
AgentE2EExpected | Min findings, assertion mode |
Loader Functions
| Function | Description |
|---|---|
LoadAgentFixture(path) | Load a JSON fixture file |
LoadAgentParsingDefinition(path) | Load one parsing YAML |
LoadAgentParsingDefinitionsFromDir(dir) | Load all parsing YAMLs from directory |
LoadAgentQualityDefinition(path) | Load one quality YAML (defaults assertion to “soft”) |
LoadAgentQualityDefinitionsFromDir(dir) | Load all quality YAMLs from directory |
LoadAgentHandoffDefinition(path) | Load one handoff YAML (defaults record assertion to “strict”) |
LoadAgentHandoffDefinitionsFromDir(dir) | Load all handoff YAMLs from directory |
LoadAgentE2EDefinition(path) | Load one E2E YAML (defaults assertion to “soft”) |
LoadAgentE2EDefinitionsFromDir(dir) | Load all E2E YAMLs from directory |
AgentDefinitionsDir() | Returns path to definitions/whitebox/agent/ |
AgentFixturesDir() | Returns path to test/testdata/agent-fixtures/ |
Test Helpers
| Function | Description |
|---|---|
fixturePath(name) | Resolves to test/testdata/agent-fixtures/{name} |
definitionsDir() | Resolves to test/benchmark/definitions/whitebox/agent |
stubPath(framework) | Resolves to test/testdata/sast-stubs/{framework} |
findFindingByCWE(findings, cwe) | Search findings by CWE identifier |
findFindingByVulnType(findings, vulnType) | Search findings by title or tag substring |
findRecordByMethod(records, method, urlPrefix) | Search records by method and URL prefix |
buildSeverityDistribution(findings) | Count findings per severity level |
Exported Agent Functions Under Test
| Function | Package | Used In |
|---|---|---|
ParseFindings(raw) | pkg/agent | Layers 1, 2 |
ParseHTTPRecords(raw) | pkg/agent | Layers 1, 3, 4 |
ToHTTPRequestResponse(rec) | pkg/agent | Layers 3, 4 |
ToDBFinding(af, moduleID, scanUUID) | pkg/agent | (Available for future use) |
CI Integration
Recommended Strategy
| Trigger | What to Run | Timeout | Notes |
|---|---|---|---|
| On every PR | make test-agent-benchmark | 10 min | No Docker, no LLM, no external deps |
| Nightly | make test-agent-benchmark-e2e | 20 min | Requires Docker apps running |
| Monthly / manual | make benchmark-agent-generate | 30 min | Requires configured agent, regenerates fixtures |
Example GitHub Actions Workflow
Troubleshooting
Fixture file not found
- Check out the repository with LFS (fixtures are committed)
- Run
make benchmark-agent-generateto regenerate them (requires configured agent)
Parsing count mismatch
finding_count to match the actual fixture, or regenerate the fixture.
Quality soft assertion warnings
- The fixture was generated with a different model
- The source stub doesn’t clearly exhibit the vulnerability
- The prompt template doesn’t emphasize that vulnerability class
Handoff conversion errors
skipped_count in the YAML definition should account for these.
E2E target app not reachable
Generator fails with “agent not found”
~/.vigolium/vigolium-configs.yaml:
Build tag errors
agent_benchmark build tag, which gopls doesn’t include by default. Add it to your editor’s gopls configuration:
go test -tags=agent_benchmark handles this correctly.