Files
skybridge/kms/docs/SYSTEM_IMPLEMENTATION_GUIDE.md
2025-08-26 19:16:41 -04:00

26 KiB

KMS System Implementation Guide

This document provides detailed implementation guidance for the KMS system, covering areas not extensively documented in other files. It serves as a comprehensive reference for developers working on system components.

Table of Contents

  1. Documentation Consistency Analysis
  2. Audit System Implementation
  3. Multi-Tenancy Support
  4. Cache Implementation Details
  5. Error Handling Framework
  6. Validation System
  7. Metrics and Monitoring
  8. Database Migration System
  9. Frontend Architecture
  10. Configuration Management

Documentation Consistency Analysis

Current State Assessment

The existing documentation is comprehensive but has some minor inconsistencies with the actual codebase:

Accurate Documentation Areas:

  • API endpoints match the implementation in handlers
  • Database schema aligns with migrations (especially the new audit_events table)
  • Authentication flows are correctly documented
  • Docker compose setup matches actual configuration
  • Security architecture accurately reflects implementation
  • Permission system documentation is consistent with code

⚠️ Minor Inconsistencies Found:

  1. Port references: Some docs mention port 80 but actual nginx runs on 8081
  2. Container names: Documentation uses generic names, actual compose uses specific names like kms-postgres
  3. Rate limiting values: Docs show different values than actual middleware implementation
  4. Frontend build process: React version mentioned as 18, but package.json shows 19+

Recently Added Features (Not in Original Docs):

  • Audit system with comprehensive event logging
  • Multi-tenancy support in database schema
  • Advanced caching layer with Redis integration
  • SAML authentication implementation
  • Advanced security middleware with brute force protection

Audit System Implementation

Overview

The KMS implements a comprehensive audit logging system that tracks all system events, user actions, and security-related activities.

Core Components

Audit Event Structure

// File: internal/audit/audit.go
type AuditEvent struct {
    ID           uuid.UUID              `json:"id"`
    Type         EventType              `json:"type"`
    Severity     Severity               `json:"severity"`
    Status       Status                 `json:"status"`
    Timestamp    time.Time              `json:"timestamp"`
    
    // Actor information
    ActorID      string                 `json:"actor_id"`
    ActorType    ActorType              `json:"actor_type"`
    ActorIP      string                 `json:"actor_ip"`
    UserAgent    string                 `json:"user_agent"`
    
    // Multi-tenancy support
    TenantID     *uuid.UUID             `json:"tenant_id,omitempty"`
    
    // Resource information
    ResourceID   string                 `json:"resource_id"`
    ResourceType string                 `json:"resource_type"`
    
    // Event details
    Action       string                 `json:"action"`
    Description  string                 `json:"description"`
    Details      map[string]interface{} `json:"details"`
    
    // Request context
    RequestID    string                 `json:"request_id"`
    SessionID    string                 `json:"session_id"`
    
    // Metadata
    Tags         []string               `json:"tags"`
    Metadata     map[string]interface{} `json:"metadata"`
}

Event Types Taxonomy

auth.* - Authentication events
├── auth.login - Successful user login
├── auth.login_failed - Failed login attempt
├── auth.logout - User logout
├── auth.token_created - Token generation
├── auth.token_revoked - Token revocation
└── auth.token_validated - Token validation

session.* - Session management
├── session.created - New session created
├── session.revoked - Session terminated
└── session.expired - Session timeout

app.* - Application management
├── app.created - Application created
├── app.updated - Application modified
└── app.deleted - Application removed

permission.* - Permission operations
├── permission.granted - Permission assigned
├── permission.revoked - Permission removed
└── permission.denied - Access denied

tenant.* - Multi-tenant operations
├── tenant.created - New tenant
├── tenant.updated - Tenant modified
├── tenant.suspended - Tenant suspended
└── tenant.activated - Tenant reactivated

Database Schema

-- File: migrations/004_add_audit_events.up.sql
CREATE TABLE audit_events (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    type VARCHAR(50) NOT NULL,
    severity VARCHAR(20) NOT NULL CHECK (severity IN ('info', 'warning', 'error', 'critical')),
    status VARCHAR(20) NOT NULL CHECK (status IN ('success', 'failure', 'pending')),
    timestamp TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
    
    -- Actor information
    actor_id VARCHAR(255),
    actor_type VARCHAR(50) CHECK (actor_type IN ('user', 'system', 'service')),
    actor_ip INET,
    user_agent TEXT,
    
    -- Multi-tenancy
    tenant_id UUID,
    
    -- Resource tracking
    resource_id VARCHAR(255),
    resource_type VARCHAR(100),
    action VARCHAR(100) NOT NULL,
    description TEXT NOT NULL,
    details JSONB DEFAULT '{}',
    
    -- Request context
    request_id VARCHAR(100),
    session_id VARCHAR(255),
    
    -- Metadata
    tags TEXT[],
    metadata JSONB DEFAULT '{}'
);

Frontend Integration

// File: kms-frontend/src/components/Audit.tsx
interface AuditEvent {
  id: string;
  type: string;
  severity: 'info' | 'warning' | 'error' | 'critical';
  status: 'success' | 'failure' | 'pending';
  timestamp: string;
  actor_id: string;
  actor_type: string;
  resource_type: string;
  action: string;
  description: string;
}

const Audit: React.FC = () => {
  // Real-time audit log viewing with filtering
  // Timeline view for event sequences
  // Statistics dashboard for audit metrics
};

Implementation Guidelines

Logging Best Practices

  1. Log all security-relevant events
  2. Include sufficient context for forensic analysis
  3. Use structured logging with consistent fields
  4. Implement log retention policies
  5. Ensure tamper-evident logging

Performance Considerations

  1. Asynchronous logging to avoid blocking operations
  2. Batch inserts for high-volume events
  3. Proper indexing on commonly queried fields
  4. Archival strategy for historical data

Multi-Tenancy Support

Architecture

The KMS implements a multi-tenant architecture where each tenant has isolated data and permissions while sharing the same application instance.

Database Design

Tenant Model

// File: internal/domain/tenant.go
type Tenant struct {
    ID              uuid.UUID              `json:"id" db:"id"`
    Name            string                 `json:"name" db:"name"`
    Slug            string                 `json:"slug" db:"slug"`
    Status          TenantStatus           `json:"status" db:"status"`
    Settings        TenantSettings         `json:"settings" db:"settings"`
    Metadata        map[string]interface{} `json:"metadata" db:"metadata"`
    CreatedAt       time.Time              `json:"created_at" db:"created_at"`
    UpdatedAt       time.Time              `json:"updated_at" db:"updated_at"`
}

type TenantStatus string

const (
    TenantStatusActive    TenantStatus = "active"
    TenantStatusSuspended TenantStatus = "suspended"
    TenantStatusPending   TenantStatus = "pending"
)

Data Isolation Strategy

-- All tenant-specific tables include tenant_id
ALTER TABLE applications ADD COLUMN tenant_id UUID REFERENCES tenants(id);
ALTER TABLE static_tokens ADD COLUMN tenant_id UUID REFERENCES tenants(id);
ALTER TABLE user_sessions ADD COLUMN tenant_id UUID REFERENCES tenants(id);
ALTER TABLE audit_events ADD COLUMN tenant_id UUID;

-- Row-level security policies
ALTER TABLE applications ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON applications 
    FOR ALL TO kms_user 
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

Implementation Pattern

Tenant Context Middleware

// File: internal/middleware/tenant.go
func TenantMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        tenantID := extractTenantID(c)
        if tenantID == "" {
            c.AbortWithStatusJSON(400, gin.H{"error": "tenant_required"})
            return
        }
        
        // Set tenant context
        c.Set("tenant_id", tenantID)
        
        // Set database session variable
        db := c.MustGet("db").(*sql.DB)
        _, err := db.Exec("SELECT set_config('app.current_tenant', $1, true)", tenantID)
        if err != nil {
            c.AbortWithStatusJSON(500, gin.H{"error": "tenant_setup_failed"})
            return
        }
        
        c.Next()
    }
}

Usage Guidelines

  1. Always include tenant_id in database queries
  2. Validate tenant access in middleware
  3. Implement tenant-aware caching
  4. Audit cross-tenant operations
  5. Test tenant isolation thoroughly

Cache Implementation Details

Architecture

The KMS implements a layered caching system with multiple providers and configurable TTL policies.

Cache Interface

// File: internal/cache/cache.go
type CacheManager interface {
    Get(ctx context.Context, key string) ([]byte, error)
    Set(ctx context.Context, key string, value []byte, ttl time.Duration) error
    GetJSON(ctx context.Context, key string, dest interface{}) error
    SetJSON(ctx context.Context, key string, value interface{}, ttl time.Duration) error
    Delete(ctx context.Context, key string) error
    Clear(ctx context.Context) error
    Exists(ctx context.Context, key string) (bool, error)
}

Redis Implementation

// File: internal/cache/redis.go
type RedisCacheManager struct {
    client     redis.Client
    keyPrefix  string
    serializer JSONSerializer
    logger     *zap.Logger
}

func (r *RedisCacheManager) GetJSON(ctx context.Context, key string, dest interface{}) error {
    prefixedKey := r.keyPrefix + key
    data, err := r.client.Get(ctx, prefixedKey).Bytes()
    if err != nil {
        if err == redis.Nil {
            return ErrCacheMiss
        }
        return fmt.Errorf("failed to get key %s: %w", prefixedKey, err)
    }
    
    return r.serializer.Deserialize(data, dest)
}

Cache Key Management

type CacheKey string

const (
    KeyPrefixAuth       = "auth:"
    KeyPrefixToken      = "token:"
    KeyPrefixPermission = "perm:"
    KeyPrefixSession    = "sess:"
    KeyPrefixApp        = "app:"
)

func CacheKey(prefix, suffix string) string {
    return fmt.Sprintf("%s%s", prefix, suffix)
}

Usage Patterns

Authentication Caching

// Cache authentication results for 5 minutes
cacheKey := cache.CacheKey(cache.KeyPrefixAuth, fmt.Sprintf("%s:%s", userID, appID))
err := cacheManager.SetJSON(ctx, cacheKey, authResult, 5*time.Minute)

Token Revocation List

// Cache revoked tokens until their expiration
revokedKey := cache.CacheKey(cache.KeyPrefixToken, "revoked:"+tokenID)
err := cacheManager.Set(ctx, revokedKey, []byte("1"), tokenExpiry.Sub(time.Now()))

Configuration

# Cache configuration
CACHE_ENABLED=true
CACHE_PROVIDER=redis  # or memory
REDIS_ADDR=localhost:6379
REDIS_PASSWORD=
REDIS_DB=0
CACHE_DEFAULT_TTL=5m

Error Handling Framework

Error Type Hierarchy

// File: internal/errors/errors.go
type ErrorCode string

const (
    ErrorCodeValidation      ErrorCode = "validation_error"
    ErrorCodeAuthentication  ErrorCode = "authentication_error"
    ErrorCodeAuthorization   ErrorCode = "authorization_error"
    ErrorCodeNotFound        ErrorCode = "not_found"
    ErrorCodeConflict        ErrorCode = "conflict"
    ErrorCodeInternal        ErrorCode = "internal_error"
    ErrorCodeRateLimit       ErrorCode = "rate_limit_exceeded"
    ErrorCodeBadRequest      ErrorCode = "bad_request"
)

type APIError struct {
    Code       ErrorCode   `json:"code"`
    Message    string      `json:"message"`
    Details    interface{} `json:"details,omitempty"`
    HTTPStatus int         `json:"-"`
    Cause      error       `json:"-"`
}

Error Factory Functions

func NewValidationError(message string, details interface{}) *APIError {
    return &APIError{
        Code:       ErrorCodeValidation,
        Message:    message,
        Details:    details,
        HTTPStatus: http.StatusBadRequest,
    }
}

func NewAuthenticationError(message string) *APIError {
    return &APIError{
        Code:       ErrorCodeAuthentication,
        Message:    message,
        HTTPStatus: http.StatusUnauthorized,
    }
}

Error Handler Middleware

// File: internal/errors/secure_responses.go
func (e *ErrorHandler) HandleError(c *gin.Context, err error) {
    var apiErr *APIError
    if errors.As(err, &apiErr) {
        // Log error with context
        e.logger.Error("API error",
            zap.String("error_code", string(apiErr.Code)),
            zap.String("message", apiErr.Message),
            zap.Int("http_status", apiErr.HTTPStatus),
            zap.Error(apiErr.Cause))
        
        c.JSON(apiErr.HTTPStatus, gin.H{
            "error":   apiErr.Code,
            "message": apiErr.Message,
            "details": apiErr.Details,
        })
        return
    }
    
    // Handle unexpected errors
    e.logger.Error("Unexpected error", zap.Error(err))
    c.JSON(http.StatusInternalServerError, gin.H{
        "error":   ErrorCodeInternal,
        "message": "An internal error occurred",
    })
}

Validation System

Validator Implementation

// File: internal/validation/validator.go
type Validator struct {
    validator *validator.Validate
    logger    *zap.Logger
}

func NewValidator(logger *zap.Logger) *Validator {
    v := validator.New()
    
    // Register custom validators
    v.RegisterValidation("app_id", validateAppID)
    v.RegisterValidation("token_type", validateTokenType)
    v.RegisterValidation("permission_scope", validatePermissionScope)
    
    return &Validator{
        validator: v,
        logger:    logger,
    }
}

Custom Validation Rules

func validateAppID(fl validator.FieldLevel) bool {
    appID := fl.Field().String()
    // App ID format: domain.app (e.g., com.example.app)
    pattern := `^[a-z0-9]+(\.[a-z0-9]+)*\.[a-z0-9]+$`
    match, _ := regexp.MatchString(pattern, appID)
    return match && len(appID) >= 3 && len(appID) <= 100
}

func validatePermissionScope(fl validator.FieldLevel) bool {
    scope := fl.Field().String()
    // Permission format: domain.action (e.g., app.read)
    pattern := `^[a-z_]+(\.[a-z_]+)*$`
    match, _ := regexp.MatchString(pattern, scope)
    return match && len(scope) >= 1 && len(scope) <= 50
}

Middleware Integration

// File: internal/middleware/validation.go
func (v *ValidationMiddleware) ValidateJSON(schema interface{}) gin.HandlerFunc {
    return gin.HandlerFunc(func(c *gin.Context) {
        if err := c.ShouldBindJSON(schema); err != nil {
            var validationErrors []ValidationError
            
            if errs, ok := err.(validator.ValidationErrors); ok {
                for _, e := range errs {
                    validationErrors = append(validationErrors, ValidationError{
                        Field:   e.Field(),
                        Message: e.Tag(),
                        Value:   e.Value(),
                    })
                }
            }
            
            apiErr := errors.NewValidationError("Request validation failed", validationErrors)
            v.errorHandler.HandleError(c, apiErr)
            return
        }
        
        c.Next()
    })
}

Metrics and Monitoring

Prometheus Integration

// File: internal/metrics/metrics.go
type Metrics struct {
    // HTTP metrics
    httpRequestsTotal     *prometheus.CounterVec
    httpRequestDuration   *prometheus.HistogramVec
    httpRequestsInFlight  prometheus.Gauge
    
    // Auth metrics
    authAttemptsTotal     *prometheus.CounterVec
    authSuccessTotal      *prometheus.CounterVec
    authFailuresTotal     *prometheus.CounterVec
    
    // Token metrics
    tokensIssuedTotal     *prometheus.CounterVec
    tokenValidationsTotal *prometheus.CounterVec
    
    // Business metrics
    applicationsTotal     prometheus.Gauge
    activeSessionsTotal   prometheus.Gauge
}

Metrics Collection

func (m *Metrics) RecordHTTPRequest(method, path string, statusCode int, duration time.Duration) {
    m.httpRequestsTotal.WithLabelValues(method, path, strconv.Itoa(statusCode)).Inc()
    m.httpRequestDuration.WithLabelValues(method, path).Observe(duration.Seconds())
}

func (m *Metrics) RecordAuthAttempt(provider, result string) {
    m.authAttemptsTotal.WithLabelValues(provider, result).Inc()
    if result == "success" {
        m.authSuccessTotal.WithLabelValues(provider).Inc()
    } else {
        m.authFailuresTotal.WithLabelValues(provider).Inc()
    }
}

Dashboard Configuration

# Grafana dashboard config
panels:
  - title: "Request Rate"
    type: "graph"
    targets:
      - expr: "rate(http_requests_total[5m])"
        legendFormat: "{{method}} {{path}}"
  
  - title: "Authentication Success Rate"
    type: "stat"
    targets:
      - expr: "rate(auth_success_total[5m]) / rate(auth_attempts_total[5m]) * 100"
        legendFormat: "Success Rate %"
  
  - title: "Active Applications"
    type: "stat"
    targets:
      - expr: "applications_total"
        legendFormat: "Applications"

Database Migration System

Migration Structure

migrations/
├── 001_initial_schema.up.sql
├── 001_initial_schema.down.sql
├── 002_user_sessions.up.sql
├── 002_user_sessions.down.sql
├── 003_add_token_prefix.up.sql
├── 003_add_token_prefix.down.sql
├── 004_add_audit_events.up.sql
└── 004_add_audit_events.down.sql

Migration Runner

// File: internal/database/postgres.go
func RunMigrations(db *sql.DB, migrationPath string) error {
    driver, err := postgres.WithInstance(db, &postgres.Config{})
    if err != nil {
        return fmt.Errorf("failed to create migration driver: %w", err)
    }
    
    m, err := migrate.NewWithDatabaseInstance(
        fmt.Sprintf("file://%s", migrationPath),
        "postgres", driver)
    if err != nil {
        return fmt.Errorf("failed to create migration instance: %w", err)
    }
    
    if err := m.Up(); err != nil && err != migrate.ErrNoChange {
        return fmt.Errorf("failed to run migrations: %w", err)
    }
    
    return nil
}

Migration Best Practices

  1. Always create both up and down migrations
  2. Test migrations on copy of production data
  3. Make migrations idempotent
  4. Add proper indexes for performance
  5. Include rollback procedures

Example Migration

-- 005_add_oauth_providers.up.sql
CREATE TABLE IF NOT EXISTS oauth_providers (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    name VARCHAR(100) NOT NULL UNIQUE,
    client_id VARCHAR(255) NOT NULL,
    client_secret_encrypted TEXT NOT NULL,
    authorization_url TEXT NOT NULL,
    token_url TEXT NOT NULL,
    user_info_url TEXT NOT NULL,
    scopes TEXT[] DEFAULT ARRAY['openid', 'profile', 'email'],
    enabled BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_oauth_providers_name ON oauth_providers(name);
CREATE INDEX idx_oauth_providers_enabled ON oauth_providers(enabled) WHERE enabled = true;

Frontend Architecture

Component Structure

src/
├── components/
│   ├── Applications.tsx      # Application management
│   ├── Tokens.tsx           # Token operations  
│   ├── Users.tsx            # User management
│   ├── Audit.tsx            # Audit log viewer
│   ├── Dashboard.tsx        # Main dashboard
│   ├── Login.tsx            # Authentication
│   ├── TokenTester.tsx      # Token testing utility
│   └── TokenTesterCallback.tsx
├── contexts/
│   └── AuthContext.tsx      # Authentication state
├── services/
│   └── apiService.ts        # API client
├── App.tsx                  # Main application
└── index.tsx                # Entry point

API Service Implementation

// File: kms-frontend/src/services/apiService.ts
class APIService {
  private baseURL: string;
  private token: string | null = null;

  constructor(baseURL: string) {
    this.baseURL = baseURL;
  }

  async request<T>(endpoint: string, options: RequestInit = {}): Promise<T> {
    const url = `${this.baseURL}${endpoint}`;
    const headers = {
      'Content-Type': 'application/json',
      'X-User-Email': this.getUserEmail(),
      ...options.headers,
    };

    const response = await fetch(url, {
      ...options,
      headers,
    });

    if (!response.ok) {
      const error = await response.json().catch(() => ({}));
      throw new APIError(error.message || 'Request failed', response.status);
    }

    return response.json();
  }

  // Application management
  async getApplications(): Promise<Application[]> {
    return this.request<Application[]>('/api/applications');
  }

  // Audit log access
  async getAuditEvents(params: AuditQueryParams): Promise<AuditEvent[]> {
    const queryString = new URLSearchParams(params).toString();
    return this.request<AuditEvent[]>(`/api/audit/events?${queryString}`);
  }
}

Authentication Context

// File: kms-frontend/src/contexts/AuthContext.tsx
interface AuthContextType {
  user: User | null;
  login: (email: string) => Promise<void>;
  logout: () => void;
  isAuthenticated: boolean;
  isLoading: boolean;
}

export const AuthContext = React.createContext<AuthContextType | null>(null);

export const AuthProvider: React.FC<{ children: React.ReactNode }> = ({ children }) => {
  const [user, setUser] = useState<User | null>(null);
  const [isLoading, setIsLoading] = useState(true);

  const login = async (email: string) => {
    try {
      setIsLoading(true);
      const response = await apiService.login(email);
      setUser({ email, token: response.token });
      localStorage.setItem('kms_user', JSON.stringify({ email }));
    } catch (error) {
      throw error;
    } finally {
      setIsLoading(false);
    }
  };

  // ... rest of implementation
};

Configuration Management

Configuration Interface

// File: internal/config/config.go
type ConfigProvider interface {
    GetString(key string) string
    GetInt(key string) int
    GetBool(key string) bool
    GetDuration(key string) time.Duration
    GetStringSlice(key string) []string
    IsSet(key string) bool
    Validate() error
    GetDatabaseDSN() string
    GetServerAddress() string
    IsDevelopment() bool
    IsProduction() bool
}

Configuration Validation

func (c *Config) Validate() error {
    var errors []string
    
    // Required configuration
    required := []string{
        "INTERNAL_HMAC_KEY",
        "JWT_SECRET", 
        "AUTH_SIGNING_KEY",
        "DB_HOST",
        "DB_NAME",
    }
    
    for _, key := range required {
        if !c.IsSet(key) {
            errors = append(errors, fmt.Sprintf("required configuration %s is not set", key))
        }
    }
    
    // Validate key lengths
    if len(c.GetString("INTERNAL_HMAC_KEY")) < 32 {
        errors = append(errors, "INTERNAL_HMAC_KEY must be at least 32 characters")
    }
    
    if len(errors) > 0 {
        return fmt.Errorf("configuration validation failed: %s", strings.Join(errors, ", "))
    }
    
    return nil
}

Environment Configuration

# Security Configuration
INTERNAL_HMAC_KEY=3924f352b7ea63b27db02bf4b0014f2961a5d2f7c27643853a4581bb3a5457cb
JWT_SECRET=7f5e11d55e957988b00ce002418680af384219ef98c50d08cbbbdd541978450c
AUTH_SIGNING_KEY=484f921b39c383e6b3e0cc5a7cef3c2cec3d7c8d474ab5102891dc4c2bf63a68

# Database Configuration  
DB_HOST=postgres
DB_PORT=5432
DB_NAME=kms
DB_USER=postgres
DB_PASSWORD=postgres

# Feature Flags
RATE_LIMIT_ENABLED=true
CACHE_ENABLED=false
METRICS_ENABLED=true
SAML_ENABLED=false

Implementation Best Practices

Code Organization

  1. Follow clean architecture principles
  2. Use dependency injection throughout
  3. Implement comprehensive error handling
  4. Add structured logging to all components
  5. Write unit tests for business logic

Security Guidelines

  1. Always validate input at API boundaries
  2. Use parameterized database queries
  3. Implement proper authentication and authorization
  4. Log all security-relevant events
  5. Follow principle of least privilege

Performance Considerations

  1. Implement caching for frequently accessed data
  2. Use database indexes appropriately
  3. Monitor and optimize slow queries
  4. Implement proper connection pooling
  5. Use asynchronous operations where beneficial

Testing Strategy

  1. Unit tests for business logic
  2. Integration tests for API endpoints
  3. End-to-end tests for critical workflows
  4. Load testing for performance validation
  5. Security testing for vulnerability assessment

This document serves as a comprehensive implementation guide for the KMS system. It should be updated as the system evolves and new features are added.