mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 17:41:29 +00:00

Files

T

shankar0123 a923cf697c harden(auth): demo-mode residual-grants detector + cleanup endpoint + CI guard (A-8)

Audit 2026-05-11 A-8 closure. Closes the deferred Phase 2 leg of the
2026-05-10 HIGH-12 closure (2e97cc1) — production-startup observability
for actor-demo-anon residual grants + CI guard banning new synthetic-
admin code paths.

What this changes:

* cmd/server/preflight_demo_residual.go (new) runs after the DB pool +
  audit service are constructed and before the HTTPS listener starts.
  Under any non-'none' auth type it queries actor_roles for the
  synthetic actor-demo-anon and emits a WARN log + a categorized audit
  row (auth.demo_residual_grants_detected) listing every grant
  present. Migration 000029 unconditionally seeds the ar-demo-anon-admin
  row at install time, so EVERY production deploy will see this WARN
  on first boot; the intended cutover workflow is cleanup-once at
  production handover.

* CERTCTL_DEMO_MODE_RESIDUAL_STRICT (new env var on AuthConfig,
  default false) pivots the WARN to fail-closed startup refusal for
  operators who want a paranoid posture against re-seeding.

* POST /api/v1/auth/demo-residual/cleanup (new handler at
  internal/api/handler/demo_residual.go) is an admin-class
  (auth.role.assign) endpoint that removes every actor-demo-anon row
  from actor_roles and returns {removed: int64}. Idempotent; refuses
  503 under Auth.Type=none (deleting the row would break the demo
  path); audit-logs every invocation including no-op zero-removed
  calls so the admin's action is always recorded.

* scripts/ci-guards/no-new-synthetic-admin.sh pins the 17-entry
  allowlist of source files that legitimately reference the
  actor-demo-anon literal. New runtime code paths that resolve to the
  synthetic actor (the same pattern that produced the original CRIT
  class) are rejected at PR time. CI workflow auto-picks the script
  via the existing scripts/ci-guards/*.sh loop in .github/workflows/
  ci.yml; no workflow edit needed.

Regression matrix:

* cmd/server/preflight_demo_residual_test.go — 7 tests covering the
  4 main behaviour branches (testcontainers-backed, testing.Short()-
  skipped: DemoModeActive_Skips, NoResidue_Passes, HasResidue_LogsAnd
  Audits, StrictMode_RefusesStartup, DeleteDemoAnonResidue_Idempotent)
  plus 3 pure-Go stdlib unit tests for the row-string formatter +
  nil-safety contracts on both helpers.

* internal/api/handler/demo_residual_test.go — 7 stdlib+httptest
  cases: HappyPath, Idempotent_ReturnsZero, RejectsInDemoMode (503),
  CleanupError_Surfaces500, NilCleanupFn (defensive 500),
  NilAuditWriter_DoesNotPanic, MissingActorContext (falls back to
  'unknown' actor in the audit row).

* internal/api/router/openapi_parity_test.go — new
  POST /api/v1/auth/demo-residual/cleanup entry plus 6 pre-existing
  pre-A-8 entries (oidc/test, jwks-status, users CRUD, runtime-config)
  that had drifted out of SpecParityExceptions; the parity test was
  red on dev/auth-bundle-2 before my work; this commit returns it to
  green with full per-entry justifications + parity-debt notes.

Docs:

* docs/operator/security.md — new 'Demo-to-production cutover (Audit
  2026-05-11 A-8)' section explaining the WARN message, the cleanup
  curl one-liner, the equivalent SQL, the strict-mode env var, and
  the CI guard.

* docs/operator/rbac.md — Last-reviewed bump + pointer to the new
  env var + the security.md section.

* cowork/auth-bundles-audit-2026-05-10.md — HIGH-12 row gains an
  'A-8 follow-on CLOSED 2026-05-11' annotation describing the
  deferred Phase 2 leg now landed.

* CHANGELOG.md — Unreleased ### Security entry summarizing the four
  legs (detector + cleanup + strict-mode flag + CI guard) and the
  acquisition-readiness narrative this closes.

Operator-facing impact: this closes a credibility gap, not an
exploitable vulnerability. The residue requires a regression
elsewhere in the middleware chain to be exploitable. After this
fix, the canonical narrative ('RBAC primitive with no synthetic-
admin fallback') is fully true.

Refs cowork/auth-bundles-fixes-2026-05-11/08-high-demo-mode-residual-
cleanup.md.

2026-05-11 11:45:54 +00:00

17 KiB

Raw Blame History

RBAC operator reference

Last reviewed: 2026-05-11

Audit 2026-05-11 A-8 follow-on: demo-mode residual-grants detector

cleanup endpoint shipped. New env var: CERTCTL_DEMO_MODE_RESIDUAL_STRICT (default false). Operator workflow at security.md#demo-to-production-cutover-audit-2026-05-11-a-8.

This is the operator-facing reference for the role-based access control primitive that ships with Bundle 1 (auth bundle 1) of certctl. Read this if you're running certctl in production and need to grant / revoke access to API keys, set up the auditor split, or onboard the first admin.

For the threat model behind these controls, see auth-threat-model.md. For the migration flow from a pre-Bundle-1 deployment, see docs/migration/api-keys-to-rbac.md.

Mental model

Every action against the certctl HTTP / CLI / MCP / GUI surface is performed by an actor (an API key, an agent's machine identity, the synthetic demo-anon actor when the server runs in CERTCTL_AUTH_TYPE=none mode). Each actor holds zero or more roles. Each role grants a set of permissions at a scope. A request to a gated endpoint succeeds when the actor's effective permission set (the union across all held roles) contains the permission the endpoint requires.

The schema lives in migrations/000029_rbac.up.sql and ships with seven seeded default roles + a 33-permission canonical catalogue. The middleware that gates requests lives at internal/auth/require_permission.go. The service-layer authorizer that resolves "actor → permissions" lives at internal/service/auth/authorizer.go.

Default roles (seeded by migration 000029)

Role	ID	Use case	Permission shape
Admin	`r-admin`	Operator with full control	Every permission in the canonical catalogue
Operator	`r-operator`	Day-to-day cert lifecycle	`cert.`, `profile.read`, `issuer.read`, `target.`, `agent.read`, `audit.read`
Viewer	`r-viewer`	Read-only console access	`*.read` for every resource type
Agent	`r-agent`	Machine identity for `certctl-agent`	`cert.read` + `agent.heartbeat` + `agent.job.poll` + `agent.job.complete` + `agent.job.report`
MCP	`r-mcp`	Operator-equivalent for the MCP server, minus destructive ops	Like Operator without `*.delete`
CLI	`r-cli`	Day-to-day operator CLI	Like Operator + `auth.key.list` / `auth.key.create` / `auth.key.rotate`
Auditor	`r-auditor`	Compliance reviewer	`audit.read` + `audit.export` ONLY

Note on actor-type binding (Audit 2026-05-10 LOW-8): Roles in the catalogue are NOT bound to a specific actor_type. r-mcp is named for clarity ("the role MCP service accounts hold") but the schema permits granting it to any actor — including a human OIDC user. Same goes for r-cli and r-agent. The role-grant API accepts {actor_id, actor_type, role_id} tuples; the actor_type constraint lives on the grant row, not the role definition. Operators who want to enforce "only API-key actors hold r-mcp" should write that as an operator-side policy + verify via a periodic audit query against actor_roles joined to api_keys / users. Native role-to- actor-type binding is on the v2 roadmap.

The auditor split is the load-bearing one: an auditor cannot read certificates, profiles, or issuers - only audit events. That makes the role legitimate to hand to a SOC 2 / FedRAMP / PCI auditor without giving them the keys to the kingdom. The internal/domain/auth/auditor_test.go invariants pin this set going forward.

The five admin-only fine-grained perms seeded by migration 000030 (Phase 3.5 conversion) gate the high-blast-radius endpoints:

cert.bulk_revoke - POST /api/v1/certificates/bulk-revoke and the EST sibling
crl.admin - /api/v1/admin/crl/cache
scep.admin - /api/v1/admin/scep/intune/*
est.admin - /api/v1/admin/est/*
ca.hierarchy.manage - /api/v1/issuers/{id}/intermediates, /api/v1/intermediates/{id}

Only r-admin holds these by default. To delegate one, create a custom role with the specific perm and grant it to the right actor.

Permission catalogue

The catalogue is namespaced. Permission strings are stable across releases; new permissions add to the namespace, never reshape an existing one. Run certctl-cli auth permissions list (or GET /api/v1/auth/permissions) for the live catalogue.

Namespace	Examples	What the namespace gates
`cert.*`	`cert.read`, `cert.issue`, `cert.revoke`, `cert.delete`, `cert.bulk_revoke`	The certificate lifecycle surface (`/api/v1/certificates`)
`profile.*`	`profile.read`, `profile.edit`, `profile.delete`	`CertificateProfile` CRUD
`issuer.*`	`issuer.read`, `issuer.edit`, `issuer.delete`	Issuer connector config
`target.*`	`target.read`, `target.edit`, `target.delete`	Deployment target config
`agent.*`	`agent.read`, `agent.edit`, `agent.retire`, `agent.heartbeat`, `agent.job.*`	Agent fleet + agent self-service endpoints
`audit.*`	`audit.read`, `audit.export`	The audit-events surface
`auth.role.*`	`auth.role.list`, `auth.role.create`, `auth.role.edit`, `auth.role.delete`, `auth.role.assign`	RBAC management
`auth.key.*`	`auth.key.list`, `auth.key.create`, `auth.key.rotate`, `auth.key.delete`	API key management
`auth.bootstrap.*`	`auth.bootstrap.use`	Day-0 first-admin path
`crl.admin`, `scep.admin`, `est.admin`, `ca.hierarchy.manage`	(single perms)	The five admin-only fine-grained perms (see above)
`job.*`	`job.read`, `job.cancel`	Deployment job lifecycle
`approval.*`	`approval.read`, `approval.approve`, `approval.reject`	Two-person approval workflow (cert-issuance + profile-edit)
`policy.*`	`policy.read`, `policy.edit`, `policy.delete`	Compliance policies + renewal policies
`team.`, `owner.`	`team.read`, `team.edit`, `team.delete`, `owner.*`	Organizational metadata
`notification.*`	`notification.read`, `notification.edit`	Notification queue + requeue
`discovery.*`	`discovery.read`, `discovery.run`, `discovery.claim`	Agent + cloud-secret-store discovery
`network_scan.*`	`network_scan.read`, `network_scan.edit`, `network_scan.run`	TLS network scanning + SCEP probing
`healthcheck.*`	`healthcheck.read`, `healthcheck.edit`, `healthcheck.delete`, `healthcheck.acknowledge`	Uptime monitors
`digest.*`	`digest.read`, `digest.send`	Operator-summary digest emails
`verification.*`	`verification.read`, `verification.run`	Post-deploy verification
`stats.read`, `metrics.read`	(single perms)	Dashboard summary + Prometheus exposition

The full catalogue lives in internal/domain/auth/validate.go. The router-level enforcement sits in internal/api/router/router.go; the AST-level CI guard TestRouterRBACGateCoverage pins the contract — adding a new state-changing or read endpoint without an rbacGate / rbacGateScoped wrap fails CI.

Scope semantics

Permissions are granted at one of three scopes:

global - applies to every resource in the tenant. The default for the seeded role grants. A cert.read grant at global scope lets the actor read any certificate.
profile - applies only to the named CertificateProfile (matched by ID). cert.issue at scope profile/p-corp-cdn lets the actor issue against p-corp-cdn only.
issuer - applies only to the named issuer. Lets you grant issuer.edit on the production issuer to a senior operator without giving them edit on every issuer.

Global beats specific: an actor with cert.read at global scope passes a cert.read check against any specific profile or issuer even if no scoped grant exists. The reverse is also true - a scoped grant doesn't satisfy a request against a different scope. The Authorizer's CheckPermission is the single point of truth.

Note (Bundle 1 deferral): the scope_id column is not currently FK-constrained against the resource tables. An operator can grant a permission at scope profile/p-bogus without p-bogus existing; the gate still works (no rows match at request time), but the API does not 404 the grant. Bundle 2 tracks the strict-FK closure. See internal/repository/postgres/auth.go::AddPermission's TODO(bundle-2) comment.

Granting + revoking access

From the GUI

/auth/roles lists every role; click into one to see its permissions and (if you hold auth.role.edit) add or remove a permission. /auth/keys lists every actor with role grants; click "Assign role" to grant, click the × on a role tag to revoke.

The synthetic actor-demo-anon row is shown but flagged "system-managed" with the mutation buttons hidden - the server-side reserved-actor guard rejects mutations against it regardless.

From the CLI

# Identity probe - what can the current API key actually do?
certctl-cli auth me

# Roles
certctl-cli auth roles list
certctl-cli auth roles get r-admin

# Permissions catalogue
certctl-cli auth permissions list

# Key → role assignment
certctl-cli auth keys list
certctl-cli auth keys assign alice --role r-operator
certctl-cli auth keys revoke alice --role r-admin

# Walk-every-key prompt for downgrade
certctl-cli auth keys scope-down

# Audit-driven role suggestion (last 30 days of audit events)
certctl-cli auth keys scope-down --suggest
certctl-cli auth keys scope-down --suggest --apply

# JSON-driven scope-down for automation (Helm post-upgrade hook etc.)
certctl-cli auth keys scope-down --non-interactive ./scope-down.json

The mutating role-lifecycle commands (certctl-cli auth roles create / update / delete + roles add-permission / remove-permission) are tracked as Bundle 1 Phase 5.5 follow-up; today, manage custom roles via the HTTP API or GUI.

From the HTTP API

Every endpoint is documented in api/openapi.yaml under the [Auth] tag. Quick reference:

Endpoint	Permission
`GET /v1/auth/me`	(none - own data)
`GET /v1/auth/roles`	`auth.role.list`
`GET /v1/auth/roles/{id}`	`auth.role.list`
`POST /v1/auth/roles`	`auth.role.create`
`PUT /v1/auth/roles/{id}`	`auth.role.edit`
`DELETE /v1/auth/roles/{id}`	`auth.role.delete`
`GET /v1/auth/permissions`	`auth.role.list`
`POST /v1/auth/roles/{id}/permissions`	`auth.role.edit`
`DELETE /v1/auth/roles/{id}/permissions/{perm}`	`auth.role.edit`
`GET /v1/auth/keys`	`auth.role.list`
`POST /v1/auth/keys/{id}/roles`	`auth.role.assign`
`DELETE /v1/auth/keys/{id}/roles/{role_id}` (+ optional `?scope_type=` / `?scope_id=`)	`auth.role.assign`
`GET /v1/auth/check`	(authenticated; surfaces effective perms)
`GET /v1/auth/bootstrap` + `POST /v1/auth/bootstrap`	(auth-exempt; gated by env-var token)

Revoke: legacy "all variants" vs scope-selective (Audit 2026-05-11 A-4)

DELETE /v1/auth/keys/{id}/roles/{role_id} runs in one of two modes, selected by presence of the optional query parameters:

No query params (legacy "revoke all variants") — every scoped grant of this role held by this actor is dropped. Idempotent: zero-row deletes return 204 (no error). This is the pre-A-4 behaviour and remains the default for the CLI / GUI buttons that don't know about scope.
```
# Drop EVERY variant of r-operator from alice (global, profile-scoped,
# issuer-scoped — all gone).
curl -X DELETE https://certctl.example.com/api/v1/auth/keys/alice/roles/r-operator
```

?scope_type= (+ optional ?scope_id=) — drop ONE variant. Used when an actor holds the same role at multiple scopes (HIGH-10 made that representable; A-4 makes it selectively revocable). scope_type=global requires scope_id to be absent; scope_type=profile / issuer require scope_id. No match returns 404 so operators get feedback when they target a scope variant the actor doesn't hold.

# Alice holds r-operator scoped to p-acme AND p-globex.
# Drop ONLY the p-acme grant; the p-globex grant stays.
curl -X DELETE 'https://certctl.example.com/api/v1/auth/keys/alice/roles/r-operator?scope_type=profile&scope_id=p-acme'

# Drop ONLY the global grant of r-operator (keeps any profile / issuer variants):
curl -X DELETE 'https://certctl.example.com/api/v1/auth/keys/alice/roles/r-operator?scope_type=global'

The audit row's details payload records which mode fired — scope: "all_variants" for the legacy path, or the explicit scope_type + scope_id for selective revoke — so SOC / SIEM can distinguish wide cleanups from targeted demotions in the access log.

From the MCP server

Bundle 1 Phase 11 ships 12 RBAC tools: certctl_auth_me, certctl_auth_list_roles, certctl_auth_get_role, certctl_auth_create_role, certctl_auth_update_role, certctl_auth_delete_role, certctl_auth_list_permissions, certctl_auth_add_permission_to_role, certctl_auth_remove_permission_from_role, certctl_auth_list_keys, certctl_auth_assign_role_to_key, certctl_auth_revoke_role_from_key. Each routes through the same HTTP surface above; permission gates fire server-side.

The auditor pattern

Hand the auditor key to compliance reviewers. They get:

GET /api/v1/audit?category=auth - every auth/authz mutation in the system (role creates, role grants on actors, bootstrap consumption, etc.).
GET /api/v1/audit?category=cert_lifecycle - every cert event.
GET /api/v1/audit?category=config - every issuer / target / settings edit.
GET /api/v1/audit/export - bulk export.

They do NOT get cert read, profile read, issuer read, or any mutating permission. The categorization is enforced by the database CHECK constraint (migration 000032); the WORM trigger from migration 000018 keeps the audit table append-only at the DB layer.

To create an auditor key:

certctl-cli auth keys assign <key-id> --role r-auditor
(Optional) Revoke any other roles the key holds with certctl-cli auth keys revoke <key-id> --role r-...
Confirm via certctl-cli auth me while authenticated as the auditor key - the response should show only audit.read and audit.export in effective_permissions.

Day-0 bootstrap (first-admin path)

Bundle 1 Phase 6 ships a one-shot bootstrap endpoint for fresh deployments where no admin actor exists yet.

Set CERTCTL_BOOTSTRAP_TOKEN=$(openssl rand -hex 32) in the server environment.
Boot the server. Logs include "bootstrap endpoint enabled - POST /api/v1/auth/bootstrap to mint the first admin key (one-shot)" when the path is callable.

Run a single curl:

curl -X POST $URL/api/v1/auth/bootstrap \
  -H 'Content-Type: application/json' \
  -d '{"token":"<the-token>","actor_name":"first-admin"}'

Capture the key_value from the response. It is shown ONCE. The server never logs it.
Use the new key to authenticate against the rest of the API. The bootstrap path is now closed: subsequent calls return HTTP 410 Gone, even with the same valid token, because an admin actor exists.

The token is constant-time-compared. The server logs a startup warning if CERTCTL_BOOTSTRAP_TOKEN is set AND admin actors already exist (config-drift signal). For OIDC-first-admin (the "first user who signs in via SSO becomes admin" pattern), wait for Bundle 2.

Demo mode (`CERTCTL_AUTH_TYPE=none`)

When auth is disabled, the server injects a synthetic actor actor-demo-anon into every request context. That actor holds r-admin at global scope (seeded by migration 000029), so every gated route resolves with a populated actor and admin grants. The synthetic actor is reserved: the API rejects any mutation that targets it (HTTP 409 with ErrAuthReservedActor).

Production deployments MUST NOT use demo mode - there is no per-request actor identity for the audit trail, and every request flows as admin. Use it for the docker compose up demo + the five example folders only.

Where to look next

Threat model - what attacks this primitive defends against and which it does not
Migration guide - moving pre-Bundle-1 deployments onto RBAC
Profiles - the RequiresApproval=true flow that Bundle 1 Phase 9 closure protects from flip-flop
Approval workflow - the Rank 7 Infisical deep-research deliverable that the Phase 9 closure piggybacks on
internal/auth/ - the middleware + keystore + RequirePermission
internal/service/auth/ - the service-layer Authorizer
cowork/auth-bundle-1-prompt.md - the design + phase plan
cowork/auth-bundles-index.md - the per-phase status tracker

17 KiB Raw Blame History Unescape Escape