certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 18:51:32 +00:00

Author	SHA1	Message	Date
shankar0123	17b30c1f7f	auth-bundle-2 Phase 4: session service (cookie minting + signature validation, idle/absolute expiry, signing-key rotation, CSRF, GC), 15-case negative-test matrix, fail-fatal initial-key bootstrap Phase 4 of the bundle ships the post-login session lifecycle that backs every authenticated request once Phase 5 wires the OIDC handlers + the session middleware. The state machine is the load-bearing primitive for the Bundle 2 control plane: forge a session cookie and you bypass every RBAC gate. Service surface (internal/auth/session/service.go, ~880 LOC): - Service.Create(actorID, actorType, ip, ua) -> CreateResult Mints a session row; signs the cookie value with the active signing key; returns the cookie payload AND the CSRF token plaintext for the handler to set on the response. - Service.Validate(ValidateInput) -> Session Parses the cookie, looks up the signing key (incl. retired-but-in- retention), recomputes HMAC-SHA256, loads the session row, enforces revocation + absolute + idle expiry + optional IP/UA bind. Maps to one of 9 sentinel errors; the handler uniformly returns 401 to the wire (specific reason in the audit row). - Service.ValidateCSRF(headerValue, *Session) error Constant-time compares SHA-256(header) against the stored hash on the session row. - Service.UpdateLastSeen / Revoke / RevokeAllForActor - Service.RotateCSRFToken — mints fresh token, persists hash, returns plaintext; called on login completion, logout, role-change against actor, explicit operator rotate. - Service.RotateSigningKey — mints new active key, retires previous; retired keys stay valid for cfg.SigningKeyRetention so existing cookies don't immediately fail. - Service.EnsureInitialSigningKey — idempotent; mints first key on fresh deploys; emits auth.session_signing_key_bootstrap audit row with event_category=auth. Wired into cmd/server/main.go AFTER migrations + RBAC backfill, BEFORE the HTTP listener binds; failure is FATAL (logger.Error + os.Exit(1)) per the prompt — server refuses to boot rather than serve session-less. - Service.GarbageCollect — sweeps expired post-login sessions + pre-login rows >10min + retired-past-retention signing keys. Wired into the new internal/scheduler/scheduler.go::sessionGCLoop on a CERTCTL_SESSION_GC_INTERVAL tick. Cookie wire format (load-bearing): v1.<session_id>.<signing_key_id>.<base64url-no-pad(HMAC-SHA256)> The HMAC input is LENGTH-PREFIXED to defeat concatenation collisions: len(session_id) \|\| ":" \|\| session_id \|\| ":" \|\| len(signing_key_id) \|\| ":" \|\| signing_key_id where len(...) is the ASCII decimal byte-length. Without the length prefix, the bare-concatenation form `session_id \|\| signing_key_id` would let a forger swap one byte across the boundary — `<a, bc>` and `<ab, c>` produce identical HMAC inputs. The length prefix moves the boundary into the input itself so the two cases can never collide. The v1. version prefix is reserved. A future incompatible upgrade ships as v2. and the parser rejects unknown prefixes (no fallback). CSRF token model: - Plaintext goes in a JS-readable certctl_csrf cookie (HttpOnly=false intentional; the GUI must read it to echo into X-CSRF-Token header). - SHA-256 hash of the plaintext lives on the session row. - Validation: SHA-256(X-CSRF-Token) constant-time-compared. - Rotated by Service.RotateCSRFToken on login / logout / role-change / explicit admin-trigger. Optional defense-in-depth (default OFF): - CERTCTL_SESSION_BIND_IP — Validate compares client IP to row's recorded IP. Mismatch -> 401, audit row, session NOT auto-revoked (user may have legitimate IP change). Mobile + corporate-NAT environments leave this off. - CERTCTL_SESSION_BIND_USER_AGENT — same shape against UA. Configurable lifetimes (env vars wired in internal/config/config.go): CERTCTL_SESSION_IDLE_TIMEOUT 1h CERTCTL_SESSION_ABSOLUTE_TIMEOUT 8h CERTCTL_SESSION_SIGNING_KEY_RETENTION 24h CERTCTL_SESSION_GC_INTERVAL 1h CERTCTL_SESSION_SAMESITE Lax CERTCTL_SESSION_BIND_IP false CERTCTL_SESSION_BIND_USER_AGENT false Test surface (internal/auth/session/service_test.go, ~860 LOC): All 15 prompt-mandated negative cases: 1. Tampered cookie (HMAC byte flipped near segment start where all 6 bits are real — base64url-no-pad's last char carries only 2 bits so a tail-flip is unreliable). 1b. Tampered SESSION_ID segment (same HMAC-recompute outcome). 2. Cookie missing v1. prefix. 3. Cookie with unknown version prefix (v99). 4. Idle expiry — back-dated last_seen_at + idle_expires_at. 5. Absolute expiry — back-dated absolute_expires_at. 6. Revoked session. 7. Wrong signing key id (no row matches). 8. Cookie signed under retired-but-in-retention key SUCCEEDS. 9. Cookie signed under retired-past-retention key FAILS. 10. Concatenation collision — direct evidence that computeHMAC("abc","de") != computeHMAC("ab","cde") AND that a forged-boundary-slide cookie is rejected. 11. CSRF token missing. 12. CSRF token mismatch (constant-time compare). 13. IP-bind enabled + IP changed -> ErrSessionIPMismatch + audit row. 14. UA-bind enabled + UA changed -> ErrSessionUAMismatch + audit row. 15. EnsureInitialSigningKey RNG failure -> ErrInitialSigningKeyMintFailed wrap (cmd/server/main.go treats as fatal). Plus coverage-lift batch covering: every error wrap on every repo collaborator (Create, Get, UpdateLastSeen, UpdateCSRFTokenHash, Revoke, RevokeAllForActor, GC), every RNG-failure surface in Create / RotateCSRFToken / RotateSigningKey, every alg-pinning helper edge, the cookie parser's full negative matrix (empty, wrong segment count, missing prefixes, bad base64, wrong HMAC length), and a real-encryption round-trip via internal/crypto.EncryptIfKeySet -> DecryptIfKeySet so the v3-blob path is exercised end-to-end at the session-cookie level. Coverage: internal/auth/session 94.5% (floor 90) internal/auth/session/domain 96+% (floor 90, Phase 1) .github/coverage-thresholds.yml extended with 2 new gate entries (internal/auth/session and internal/auth/session/domain). The why: paragraphs explain why each fail-closed branch is load-bearing. Repository extensions: internal/repository/session.go gains UpdateCSRFTokenHash on the SessionRepository interface; internal/repository/postgres/session.go ships the implementation. RotateCSRFToken consumes it. Scheduler extensions: internal/scheduler/scheduler.go gains SessionGarbageCollector interface + sessionGC field + sessionGCInterval + SetSessionGarbageCollector + SetSessionGCInterval + sessionGCLoop. Pattern matches the existing acmeGCLoop: atomic.Bool guard prevents concurrent sweeps, sync.WaitGroup tracks for graceful shutdown, per-tick context.WithTimeout(1m) bounds a stuck Postgres. Server wiring: cmd/server/main.go constructs sessionService AFTER the bootstrap block (post-RBAC backfill) and BEFORE the policy-service block. EnsureInitialSigningKey runs immediately; failure is fatal via os.Exit(1). The scheduler section wires SetSessionGarbageCollector + SetSessionGCInterval alongside the other interval setters and emits an Info log so operators can confirm the loop is enabled. Phase 4 deviation note: Service.GarbageCollect() returns (int, error) rather than the prompt's literal `error`. The int is the count of session rows deleted on this sweep; the scheduler discards it (`_, err := ...`) but tests + future operator-facing audit rows can read it. The wider behavior matches the spec exactly. Verifications: gofmt clean, go vet ./internal/auth/session/... ./internal/scheduler/... ./internal/config/... ./cmd/server/... ./internal/repository/... clean, go test -short -count=1 -race green across all 3 session packages, full repository + auth + scheduler + config test sweeps green, no regressions in Bundle 1 packages.	2026-05-10 05:31:24 +00:00
shankar0123	854135dfb7	auth-bundle-2 Phase 3: OIDC service (HandleAuthRequest, HandleCallback, RefreshKeys), hand-rolled group-claim resolver, 21+ negative-test matrix, token-leak hygiene, IdP downgrade-attack defense Phase 3 of the bundle ships the business logic that turns the Phase 2 storage primitives into a working OpenID Connect 1.0 + RFC 7636 PKCE authorization-code flow against any enterprise IdP (Okta / Azure AD / Google Workspace / Keycloak / Authentik / Auth0). Service surface: - Service.HandleAuthRequest(providerID) -> authURL, cookie, preLoginID Builds the IdP redirect with PKCE-S256 (mandatory; RFC 9700 §2.1.1), server-generated 32-byte state + nonce, persisted to the pre-login row keyed by the cookie value. - Service.HandleCallback(cookie, code, state, ip, ua) -> CallbackResult 11-step validation: pre-login lookup-and-consume (single-use), constant-time state compare, code-for-token exchange with PKCE verifier, ID-token verify (alg pin via go-oidc/v3), service-layer re-checks of iss / aud / azp (multi-aud requires it; mismatch rejected) / at_hash (REQUIRED when access_token returned — Phase 3 lifts the OIDC core "MAY" to a service-level "MUST") / exp / iat-window / nonce, group-claim resolution with userinfo fallback, group->role mapping (fail-closed on no match), user upsert, session mint via SessionMinter port. - Service.RefreshKeys(providerID) — explicit cache eviction + re-load. Re-runs the IdP downgrade-attack defense so a provider that later rotates to advertising HS / none is caught BEFORE the next user login attempt. Security posture (every fail-closed branch is a sentinel error + test): - Algorithm pinning: allow-list {RS256, RS512, ES256, ES384, EdDSA}; deny-list {HS256, HS384, HS512, none}. Belt-and-braces re-check via isDisallowedAlg after go-oidc.Verify. - PKCE-S256 mandatory (oauth2.GenerateVerifier + S256ChallengeOption); `plain` rejection sentinel exists for defense-in-depth. - State + nonce: 32-byte crypto/rand, base64url-no-pad, constant-time compare, single-use. - IdP downgrade-attack defense: at provider creation / RefreshKeys, reject any IdP whose discovery doc advertises HS* / none in id_token_signing_alg_values_supported. - JWKS fail-closed: in-flight login fails 503; existing sessions untouched. isJWKSFetchError detects the gooidc verify-error shape; ErrJWKSUnreachable is the wire mapping. - Token-leak hygiene: ID tokens, access tokens, refresh tokens, authorization codes, PKCE verifiers, state, nonce, signing key bytes — NEVER logged at any level. logging_test.go pins the invariant via a slog buffer + grep-assert across HandleAuthRequest, HandleCallback, alg rejection, and provider-load paths. Group-claim resolver (internal/auth/oidc/groupclaim/): - Hand-rolled per Decision 10 (no JSON-path lib; ~150 LOC). - URL-shape paths (https:// / http://) treated as a single literal key — Auth0 namespaced claims like https://your-namespace/groups work without splitting on the dots in the URL. - Dot-separated paths walked through nested map[string]interface{}. - []interface{} / []string / single-string normalized to []string; bool / number / object / nil → fail closed. - 18 unit tests + sentinels (ErrPathEmpty, ErrSegmentMissing, ErrSegmentNotObject, ErrInvalidValueType). Test surface: - service_test.go: 57 test functions including all 21 prompt-mandated negative cases (wrong aud / wrong iss / expired / unknown alg / alg=none / HMAC alg / azp missing on multi-aud / azp mismatched / at_hash missing / at_hash mismatched / iat in future / iat too old / nonce mismatched / state mismatched / state replayed / PKCE plain sentinel / pre-login replay / forged cookie / IdP downgrade / group-claim missing / group-claim unmapped) plus the userinfo fallback matrix (happy path + endpoint-missing + endpoint-failing + userinfo-also-empty), HandleAuthRequest entry point + RNG-failure paths, upsertUser update + create + display-name fallback + Validate-error paths, decryptClientSecret real-encrypt round-trip + bad-passphrase, alg-parser malformed-header matrix. - logging_test.go: 4 hygiene tests pinning no token / code / verifier / state / cookie / client_secret / alg name appears in any captured log line. - groupclaim/resolver_test.go: 18 cases covering Okta string-array, Keycloak realm_access.roles, Auth0 namespaced URL claim, single-string normalization, deeply-nested 3-segment walks, and every fail-closed branch. Coverage: internal/auth/oidc 92.2% (floor: 90) internal/auth/oidc/groupclaim 100.0% (floor: 95) internal/auth/oidc/domain 96.2% (floor: 90) Coverage gates added at .github/coverage-thresholds.yml so a future regression in any fail-closed branch fails CI before the commit lands. Phase 3 of cowork/auth-bundle-2-prompt.md is closed. Next up: Phase 4 (Session service: cookies, revocation, sliding-vs-absolute expiry).	2026-05-10 04:56:03 +00:00
shankar0123	cbb47aaf5d	auth-bundle-1 Phase 11 + 12: RBAC MCP tools + negative-test coverage gate # Phase 11 — RBAC MCP tools 12 new tools in internal/mcp/tools_auth.go mirroring the Phase-4 + Phase-7 HTTP surface so operators driving certctl from Claude / VS Code / any MCP client get the same management capability the GUI + CLI already expose: certctl_auth_me GET /v1/auth/me certctl_auth_list_roles GET /v1/auth/roles certctl_auth_get_role GET /v1/auth/roles/{id} certctl_auth_create_role POST /v1/auth/roles certctl_auth_update_role PUT /v1/auth/roles/{id} certctl_auth_delete_role DELETE /v1/auth/roles/{id} certctl_auth_list_permissions GET /v1/auth/permissions certctl_auth_add_permission_to_role POST /v1/auth/roles/{id}/permissions certctl_auth_remove_permission_from_role DELETE /v1/auth/roles/{id}/permissions/{perm} certctl_auth_list_keys GET /v1/auth/keys certctl_auth_assign_role_to_key POST /v1/auth/keys/{id}/roles certctl_auth_revoke_role_from_key DELETE /v1/auth/keys/{id}/roles/{role_id} Each tool routes through the existing HTTP client (no parallel business logic), so permission gates fire server-side: a non-admin caller's MCP tool invocation returns whatever 403 the underlying HTTP handler emits, fenced via errorResult for LLM- prompt-injection defense. Input types in internal/mcp/types.go (AuthRoleIDInput, AuthCreateRoleInput, AuthUpdateRoleInput, AuthRolePermissionGrantInput, AuthRolePermissionRevokeInput, AuthAssignKeyRoleInput, AuthRevokeKeyRoleInput) carry jsonschema descriptions so the MCP consumer's tool catalogue shows operator-friendly hints. internal/mcp/tools_auth_test.go ships 14 tests: - TestAuthMCP_AllToolsRegister (registration must not panic) - TestAuthMCP_PathsAndMethods (table-driven, 12 rows pinning each tool's HTTP method + URL) - TestAuthMCP_ForbiddenSurfacesFencedError (12 tools × 403 mock → error surface) internal/mcp/tools_per_tool_test.go's allHappyPathCases extended with the 12 new rows so the in-memory dispatch coverage gate (TestMCP_RegisterTools_DispatchableToolCount) stays green at the new total of 139 registered tools. Re-derived total via 'grep -cE "gomcp\.AddTool\(" internal/mcp/tools.go': 133 (121 in tools.go + 12 in tools_auth.go). # Phase 12 — negative-test coverage gate Audit of the prompt's 12 negative-test paths against existing coverage: 1. Missing actor → 401 ✓ TestRequirePermission_NoActorReturns401, TestRBACGate_NoActorReturns401 2. No roles → 403 ✓ TestRequirePermission_DeniedActorReturns403, TestRBACGate_AuditorRole_403sOnAdminRoutes 3. Role lacks specific perm → 403 ✓ same suite 4. Wrong scope → 403 ✓ TestAuthorizer_SpecificScopeMatchesExactID (wrongID arm) 5. Self-grant w/o auth.role.assign → 403 ✓ TestActorRoleService_GrantRequiresAuthRoleAssign 6. Bootstrap token wrong → 401 ✓ TestEnvTokenStrategy_WrongTokenReturnsInvalidToken, TestBootstrapHandler_Mint_WrongToken_401 7. Bootstrap used twice → 410 ✓ TestEnvTokenStrategy_OneShotConsumption, TestBootstrapHandler_Mint_TwiceReturns410 8. Bootstrap when admin exists → 410 ✓ TestEnvTokenStrategy_AdminExistsClosesPath, TestBootstrapHandler_Mint_AdminExists410 9. Role delete with assignees → 409 NEW: TestRoleService_DeleteWithActorsAssignedReturns409 10. Profile-edit loophole → gated ✓ TestProfileEdit_RequiresApprovalLoopholeClosed 11. Permission not in catalog → 400 ✓ TestRoleService_AddPermissionRejectsNonCanonical 12. Scope ID for nonexistent resource → 404 (validation deferred — no FK constraint between role_permissions.scope_id and the resource tables; documented for a future bundle) Filled the gap at #9 with TestRoleService_DeleteWithActorsAssignedReturns409 which pins the repository sentinel pass-through (postgres FK ON DELETE RESTRICT → repository.ErrAuthRoleInUse → service returns the sentinel verbatim → handler maps to HTTP 409). # Coverage gates .github/coverage-thresholds.yml gains 2 entries: - internal/auth: floor 85 - internal/service/auth: floor 85 .github/workflows/ci.yml's coverage test command extended with ./internal/auth/... and ./internal/api/router/... so the threshold check has data to evaluate. # Protocol-endpoint not-gated test (Category F) internal/api/router/phase12_protocol_allowlist_test.go (new) adds 3 router-level invariant tests: - TestPhase12_ProtocolEndpointsNotGated: AST-walks router.go, asserts no rbacGate(...) call references a path under any protocol-endpoint prefix (/acme, /scep, /.well-known/est, /.well-known/pki/ocsp, /.well-known/pki/crl). - TestPhase12_IsProtocolEndpoint_CoversCanonicalPrefixes: pins auth.IsProtocolEndpoint against the canonical prefix set; if a future protocol lands without lockstep allowlist update, this fails. - TestPhase12_RBACGateRoutesAreUnderAPIv1: belt-and-braces — every rbacGate-wrapped route MUST start with /api/v1/. Catches accidental cross-prefix wraps. Complements the existing TestRequirePermission_ProtocolEndpointBypassesGate (middleware-level) + TestRouter_AuthExemptAllowlist_PinsActualRegistrations (allowlist drift) so the Category F invariant is pinned at all three layers (middleware + router + dispatch). # Verifications gofmt clean repo-wide. * go vet ./... clean. * staticcheck across internal/auth + handler + router + cli + service + repository + cmd + domain + mcp: clean. * go test -short -count=1 green across internal/auth (incl. bootstrap), internal/api/handler, internal/api/router, internal/cli, internal/service (incl. auth), internal/domain/auth, internal/mcp, cmd/server, cmd/cli.	2026-05-09 23:46:01 +00:00
shankar0123	86d92efd2b	ci-pipeline-cleanup Phase 2: coverage thresholds → YAML manifest Bundle: ci-pipeline-cleanup, Phase 2 / frozen decision 0.3. Move 9 hardcoded coverage thresholds from inline bash to a YAML manifest at .github/coverage-thresholds.yml. The load-bearing per-package context (Bundle reference, HEAD measurement, gap rationale) survives in the YAML's `why:` field instead of in inline bash comments. Adding a new gated package: one YAML entry instead of ~30 lines of bash + 50 lines of comment. Coverage check logic extracted to scripts/check-coverage-thresholds.sh so the operator can run the same check locally: bash scripts/check-coverage-thresholds.sh ci.yml dropped 557 → 417 lines (-140, total Phase 1+2: -1071, -72% from baseline 1488). Same 9 floors, same fail-on-miss semantics — pure relocation: internal/service: 70 (was: 70) internal/api/handler: 75 (was: 75) internal/domain: 40 (was: 40) internal/api/middleware: 30 (was: 30) internal/crypto: 88 (was: 88) internal/connector/issuer/local: 86 (was: 86) internal/connector/issuer/acme: 80 (was: 80) internal/connector/issuer/stepca: 80 (was: 80) internal/mcp: 85 (was: 85) Sandbox verification: - ci.yml YAML-parses cleanly - coverage-thresholds.yml YAML-parses cleanly with all 9 entries - scripts/check-coverage-thresholds.sh extracts the (pkg, floor) table correctly from the YAML	2026-04-30 20:39:30 +00:00

4 Commits