Files
skybridge/docs/DEPLOYMENT_GUIDE.md
2025-08-24 10:09:58 -04:00

30 KiB

KMS Deployment Guide

Table of Contents

  1. Deployment Overview
  2. Container Architecture
  3. Environment Configuration
  4. Docker Compose Setup
  5. Production Deployment
  6. Monitoring and Health Checks
  7. Security Configuration
  8. Backup and Recovery
  9. Troubleshooting

Deployment Overview

The KMS is designed as a containerized application using Docker and Docker Compose for orchestration. The system consists of multiple services working together to provide secure API key management capabilities.

Service Architecture

  • kms-nginx: Reverse proxy and load balancer
  • kms-api: Go backend API service
  • kms-frontend: React TypeScript frontend
  • kms-postgres: PostgreSQL database
  • kms-redis: Redis cache (optional)

Deployment Modes

  • Development: Docker Compose with hot reload
  • Testing: Isolated container environment
  • Staging: Production-like with monitoring
  • Production: High availability with load balancing

Container Architecture

graph TB
    subgraph "External Network"
        Internet[Internet]
        CDN[Content Delivery Network<br/>Static Assets]
        DNS[DNS<br/>Load Balancing]
    end
    
    subgraph "Load Balancer Tier"
        LB1[Load Balancer 1<br/>HAProxy/Nginx]
        LB2[Load Balancer 2<br/>HAProxy/Nginx]
        VIP[Virtual IP<br/>Failover]
    end
    
    subgraph "Container Orchestration - Docker Compose"
        subgraph "Reverse Proxy"
            Nginx1[Nginx Container<br/>kms-nginx<br/>Port 8081:80]
        end
        
        subgraph "API Tier"
            API1[API Service Container<br/>kms-api-service<br/>Port 8080:8080]
            Metrics1[Metrics Endpoint<br/>Port 9090:9090<br/>Prometheus]
        end
        
        subgraph "Frontend Tier"
            Frontend1[React SPA Container<br/>kms-frontend<br/>Port 3000:80]
            StaticAssets[Static Assets<br/>Nginx served]
        end
        
        subgraph "Network"
            InternalNet[kms-network<br/>Bridge Network<br/>Internal Communication]
        end
    end
    
    subgraph "Data Tier"
        subgraph "Primary Database"
            PostgreSQL1[PostgreSQL 15<br/>kms-postgres<br/>Port 5432:5432]
            DBData[Persistent Volume<br/>postgres_data]
        end
        
        subgraph "Cache Layer"
            Redis1[Redis Cache<br/>Optional<br/>Session Store]
        end
        
        subgraph "Migration System"
            Migrations[Database Migrations<br/>Auto-run on startup<br/>Volume mounted]
        end
    end
    
    subgraph "Monitoring & Observability"
        subgraph "Metrics Collection"
            Prometheus[Prometheus<br/>Metrics Scraping]
            Grafana[Grafana<br/>Dashboards]
        end
        
        subgraph "Logging"
            LogAggregator[Log Aggregation<br/>ELK Stack / Loki]
            StructuredLogs[Structured Logging<br/>JSON format]
        end
        
        subgraph "Health Monitoring"
            HealthCheck[Health Checks<br/>/health endpoint]
            Alerting[Alerting<br/>PagerDuty/Slack]
        end
    end
    
    subgraph "Security & Compliance"
        subgraph "Certificate Management"
            TLS[TLS Certificates<br/>Let's Encrypt/Manual]
            CertRotation[Certificate Rotation<br/>Automated]
        end
        
        subgraph "Secrets Management"
            EnvVars[Environment Variables<br/>Container secrets]
            VaultIntegration[Vault Integration<br/>Secret rotation]
        end
        
        subgraph "Backup & Recovery"
            DBBackup[Database Backups<br/>Automated daily]
            DisasterRecovery[Disaster Recovery<br/>Multi-region]
        end
    end
    
    subgraph "Development Environment"
        LocalDev[Local Development<br/>docker-compose.yml]
        TestEnv[Test Environment<br/>Isolated containers]
        CICDPipeline[CI/CD Pipeline<br/>GitHub Actions]
    end
    
    %% External Connections
    Internet --> DNS
    DNS --> VIP
    CDN --> Frontend1
    
    %% Load Balancer Configuration
    VIP --> LB1
    VIP --> LB2
    LB1 --> Nginx1
    LB2 --> Nginx1
    
    %% Container Communication
    Nginx1 --> API1
    Nginx1 --> Frontend1
    API1 --> PostgreSQL1
    API1 --> Redis1
    API1 --> Metrics1
    
    %% Data Flow
    PostgreSQL1 --> DBData
    API1 --> Migrations
    Migrations --> PostgreSQL1
    
    %% Network Isolation
    InternalNet --> API1
    InternalNet --> Frontend1
    InternalNet --> PostgreSQL1
    InternalNet --> Redis1
    InternalNet --> Nginx1
    
    %% Monitoring Connections
    API1 --> Prometheus
    Prometheus --> Grafana
    API1 --> LogAggregator
    LogAggregator --> StructuredLogs
    HealthCheck --> API1
    HealthCheck --> PostgreSQL1
    HealthCheck --> Alerting
    
    %% Security Connections
    TLS --> Nginx1
    CertRotation --> TLS
    EnvVars --> API1
    VaultIntegration --> EnvVars
    DBBackup --> PostgreSQL1
    DisasterRecovery --> DBBackup
    
    %% Development Flow
    CICDPipeline --> LocalDev
    LocalDev --> TestEnv
    TestEnv --> API1
    
    %% Styling
    classDef external fill:#ffecb3,stroke:#f57c00,stroke-width:2px
    classDef loadbalancer fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px
    classDef container fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
    classDef data fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef monitoring fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
    classDef security fill:#ffcdd2,stroke:#d32f2f,stroke-width:2px
    classDef development fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    
    class Internet,CDN,DNS external
    class LB1,LB2,VIP loadbalancer
    class Nginx1,API1,Frontend1,Metrics1,InternalNet,StaticAssets container
    class PostgreSQL1,Redis1,DBData,Migrations data
    class Prometheus,Grafana,LogAggregator,StructuredLogs,HealthCheck,Alerting monitoring
    class TLS,CertRotation,EnvVars,VaultIntegration,DBBackup,DisasterRecovery security
    class LocalDev,TestEnv,CICDPipeline development

Environment Configuration

Environment Variables

Database Configuration

# PostgreSQL Database
DB_HOST=postgres
DB_PORT=5432
DB_NAME=kms
DB_USER=postgres
DB_PASSWORD=secure_password_here
DB_MAX_OPEN_CONNECTIONS=25
DB_MAX_IDLE_CONNECTIONS=5
DB_CONNECTION_MAX_LIFETIME=300s
DB_CONNECTION_MAX_IDLE_TIME=60s

Server Configuration

# API Server
SERVER_HOST=0.0.0.0
SERVER_PORT=8080
SERVER_READ_TIMEOUT=30s
SERVER_WRITE_TIMEOUT=30s
SERVER_IDLE_TIMEOUT=120s

# Environment
ENVIRONMENT=production
LOG_LEVEL=info
DEBUG_MODE=false

Authentication Configuration

# Authentication Provider
AUTH_PROVIDER=header
AUTH_HEADER_USER_EMAIL=X-User-Email
AUTH_SIGNING_KEY=your_hmac_signing_key_here

# JWT Configuration
JWT_SECRET=your_jwt_secret_here
JWT_ISSUER=kms-api-service
JWT_EXPIRY=1h

# OAuth2 Configuration (optional)
OAUTH2_CLIENT_ID=your_oauth2_client_id
OAUTH2_CLIENT_SECRET=your_oauth2_client_secret
OAUTH2_REDIRECT_URL=https://your-domain.com/api/oauth2/callback
OAUTH2_PROVIDER_URL=https://oauth-provider.com

# SAML Configuration (optional)
SAML_IDP_URL=https://saml-provider.com/sso
SAML_CERT_PATH=/etc/ssl/certs/saml.crt
SAML_KEY_PATH=/etc/ssl/private/saml.key

Security Configuration

# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=100
RATE_LIMIT_BURST=200
AUTH_RATE_LIMIT_RPS=5
AUTH_RATE_LIMIT_BURST=10

# Security Settings
CSRF_TOKEN_MAX_AGE=1h
MAX_AUTH_FAILURES=5
IP_BLOCK_DURATION=1h
AUTH_FAILURE_WINDOW=15m
REQUEST_MAX_AGE=5m

# CORS Settings
CORS_ALLOWED_ORIGINS=https://your-frontend-domain.com,https://admin.your-domain.com
CORS_ALLOWED_METHODS=GET,POST,PUT,DELETE,OPTIONS
CORS_ALLOWED_HEADERS=Origin,Content-Type,Accept,Authorization,X-User-Email,X-CSRF-Token

Monitoring Configuration

# Metrics
METRICS_ENABLED=true
METRICS_PORT=9090
METRICS_PATH=/metrics

# Health Checks
HEALTH_CHECK_ENABLED=true
HEALTH_CHECK_INTERVAL=30s

Configuration Management

Environment Files

# Development environment
.env.development

# Testing environment  
.env.test

# Staging environment
.env.staging

# Production environment
.env.production

Docker Environment

# docker-compose.override.yml for local development
version: '3.8'
services:
  kms-api:
    environment:
      - LOG_LEVEL=debug
      - DEBUG_MODE=true
    volumes:
      - ./:/app
    command: air -c .air.toml  # Hot reload

Docker Compose Setup

Development Configuration

docker-compose.yml

version: '3.8'

services:
  # PostgreSQL Database
  kms-postgres:
    image: postgres:15-alpine
    container_name: kms-postgres
    environment:
      POSTGRES_DB: ${DB_NAME:-kms}
      POSTGRES_USER: ${DB_USER:-postgres}
      POSTGRES_PASSWORD: ${DB_PASSWORD:-postgres}
      POSTGRES_INITDB_ARGS: "--encoding=UTF-8 --lc-collate=C --lc-ctype=C"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./migrations:/docker-entrypoint-initdb.d:ro
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER:-postgres}"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    networks:
      - kms-network
    restart: unless-stopped

  # Redis Cache (Optional)
  kms-redis:
    image: redis:7-alpine
    container_name: kms-redis
    command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD:-redis_password}
    volumes:
      - redis_data:/data
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - kms-network
    restart: unless-stopped

  # API Service
  kms-api:
    build:
      context: .
      dockerfile: Dockerfile
      target: production
    container_name: kms-api
    environment:
      - DB_HOST=kms-postgres
      - DB_PORT=5432
      - DB_NAME=${DB_NAME:-kms}
      - DB_USER=${DB_USER:-postgres}
      - DB_PASSWORD=${DB_PASSWORD:-postgres}
      - REDIS_HOST=kms-redis
      - REDIS_PORT=6379
      - REDIS_PASSWORD=${REDIS_PASSWORD:-redis_password}
      - SERVER_PORT=8080
      - JWT_SECRET=${JWT_SECRET}
      - AUTH_SIGNING_KEY=${AUTH_SIGNING_KEY}
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - RATE_LIMIT_ENABLED=true
    ports:
      - "8080:8080"
      - "9090:9090"  # Metrics port
    depends_on:
      kms-postgres:
        condition: service_healthy
      kms-redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    networks:
      - kms-network
    restart: unless-stopped
    volumes:
      - ./logs:/app/logs

  # Frontend Service
  kms-frontend:
    build:
      context: ./kms-frontend
      dockerfile: Dockerfile
      target: production
    container_name: kms-frontend
    environment:
      - REACT_APP_API_BASE_URL=http://localhost:8081/api
    ports:
      - "3000:80"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - kms-network
    restart: unless-stopped

  # Nginx Reverse Proxy
  kms-nginx:
    image: nginx:alpine
    container_name: kms-nginx
    ports:
      - "8081:80"
      - "8443:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/default.conf:/etc/nginx/conf.d/default.conf:ro
      - ./ssl:/etc/ssl/certs:ro
    depends_on:
      - kms-api
      - kms-frontend
    healthcheck:
      test: ["CMD", "nginx", "-t"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - kms-network
    restart: unless-stopped

networks:
  kms-network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/16

volumes:
  postgres_data:
    driver: local
  redis_data:
    driver: local

Dockerfile (Multi-stage)

# Build stage
FROM golang:1.21-alpine AS builder

WORKDIR /app

# Install dependencies
RUN apk add --no-cache git ca-certificates tzdata

# Copy go mod files
COPY go.mod go.sum ./
RUN go mod download

# Copy source code
COPY . .

# Build the binary
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o main ./cmd/server

# Production stage
FROM alpine:latest AS production

RUN apk --no-cache add ca-certificates curl

WORKDIR /root/

# Copy the binary from builder stage
COPY --from=builder /app/main .
COPY --from=builder /app/migrations ./migrations

# Create non-root user
RUN addgroup -g 1001 appgroup && \
    adduser -D -u 1001 -G appgroup appuser

USER appuser

EXPOSE 8080 9090

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

CMD ["./main"]

Production Configuration

docker-compose.prod.yml

version: '3.8'

services:
  kms-postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: ${DB_NAME}
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_MAX_CONNECTIONS: 100
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./postgres/postgresql.conf:/etc/postgresql/postgresql.conf
      - ./postgres/pg_hba.conf:/etc/postgresql/pg_hba.conf
    command: postgres -c config_file=/etc/postgresql/postgresql.conf
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '0.5'
        reservations:
          memory: 512M
          cpus: '0.25'
    restart: always
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  kms-api:
    image: kms-api:latest
    environment:
      - DB_HOST=kms-postgres
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - JWT_SECRET=${JWT_SECRET}
      - AUTH_SIGNING_KEY=${AUTH_SIGNING_KEY}
      - LOG_LEVEL=info
      - ENVIRONMENT=production
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 512M
          cpus: '0.5'
        reservations:
          memory: 256M
          cpus: '0.25'
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"

  kms-nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/nginx.prod.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/ssl/certs:ro
    deploy:
      resources:
        limits:
          memory: 256M
          cpus: '0.25'
    restart: always

Production Deployment

Load Balancer Configuration

HAProxy Configuration

# /etc/haproxy/haproxy.cfg
global
    daemon
    maxconn 4096
    log stdout local0
    
defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms
    option httplog

frontend kms_frontend
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/kms.pem
    redirect scheme https if !{ ssl_fc }
    
    # Rate limiting
    stick-table type ip size 100k expire 30s store http_req_rate(10s)
    http-request track-sc0 src
    http-request reject if { sc_http_req_rate(0) gt 20 }
    
    default_backend kms_api_servers

backend kms_api_servers
    balance roundrobin
    option httpchk GET /health
    
    server api1 kms-api-1:8080 check
    server api2 kms-api-2:8080 check
    server api3 kms-api-3:8080 check

Nginx Load Balancer

# /etc/nginx/nginx.conf
upstream kms_api_backend {
    least_conn;
    server kms-api-1:8080 weight=3 max_fails=3 fail_timeout=30s;
    server kms-api-2:8080 weight=3 max_fails=3 fail_timeout=30s;
    server kms-api-3:8080 weight=3 max_fails=3 fail_timeout=30s;
}

upstream kms_frontend_backend {
    server kms-frontend-1:80;
    server kms-frontend-2:80;
}

server {
    listen 80;
    server_name kms.yourdomain.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name kms.yourdomain.com;
    
    # SSL Configuration
    ssl_certificate /etc/ssl/certs/kms.crt;
    ssl_certificate_key /etc/ssl/private/kms.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
    ssl_prefer_server_ciphers off;
    
    # Security Headers
    add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload";
    add_header X-Frame-Options DENY;
    add_header X-Content-Type-Options nosniff;
    add_header X-XSS-Protection "1; mode=block";
    add_header Referrer-Policy "strict-origin-when-cross-origin";
    
    # Rate Limiting
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
    limit_req_zone $binary_remote_addr zone=auth:10m rate=5r/s;
    
    # API Routes
    location /api/ {
        limit_req zone=api burst=20 nodelay;
        proxy_pass http://kms_api_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
    
    # Authentication Routes (stricter rate limiting)
    location /api/login {
        limit_req zone=auth burst=10 nodelay;
        proxy_pass http://kms_api_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
    
    # Frontend Routes
    location / {
        proxy_pass http://kms_frontend_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # SPA routing support
        try_files $uri $uri/ /index.html;
    }
    
    # Health Check
    location /health {
        access_log off;
        proxy_pass http://kms_api_backend;
    }
    
    # Metrics (restricted access)
    location /metrics {
        allow 10.0.0.0/8;
        allow 172.16.0.0/12;
        allow 192.168.0.0/16;
        deny all;
        proxy_pass http://kms_api_backend:9090;
    }
}

Database High Availability

PostgreSQL Primary-Replica Setup

# docker-compose.ha.yml
version: '3.8'

services:
  postgres-primary:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: kms
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_REPLICATION_USER: replicator
      POSTGRES_REPLICATION_PASSWORD: ${REPLICATION_PASSWORD}
    volumes:
      - postgres_primary_data:/var/lib/postgresql/data
      - ./postgres/primary.conf:/etc/postgresql/postgresql.conf
      - ./postgres/setup-replication.sh:/docker-entrypoint-initdb.d/setup-replication.sh
    command: postgres -c config_file=/etc/postgresql/postgresql.conf
    ports:
      - "5432:5432"

  postgres-replica:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_PRIMARY_HOST: postgres-primary
      POSTGRES_REPLICATION_USER: replicator
      POSTGRES_REPLICATION_PASSWORD: ${REPLICATION_PASSWORD}
    volumes:
      - postgres_replica_data:/var/lib/postgresql/data
      - ./postgres/replica.conf:/etc/postgresql/postgresql.conf
      - ./postgres/setup-replica.sh:/docker-entrypoint-initdb.d/setup-replica.sh
    command: postgres -c config_file=/etc/postgresql/postgresql.conf
    ports:
      - "5433:5432"
    depends_on:
      - postgres-primary

volumes:
  postgres_primary_data:
  postgres_replica_data:

Monitoring and Health Checks

Health Check Configuration

Application Health Checks

// File: internal/handlers/health.go
type HealthResponse struct {
    Status    string            `json:"status"`
    Timestamp time.Time         `json:"timestamp"`
    Services  map[string]string `json:"services"`
    Version   string            `json:"version"`
}

func (h *HealthHandler) GetHealth(c *gin.Context) {
    response := &HealthResponse{
        Status:    "healthy",
        Timestamp: time.Now(),
        Services:  make(map[string]string),
        Version:   h.version,
    }
    
    // Check database connectivity
    if err := h.db.Ping(); err != nil {
        response.Status = "unhealthy"
        response.Services["database"] = "unhealthy"
    } else {
        response.Services["database"] = "healthy"
    }
    
    // Check Redis connectivity
    if err := h.redis.Ping(); err != nil {
        response.Services["cache"] = "unhealthy"
    } else {
        response.Services["cache"] = "healthy"
    }
    
    statusCode := http.StatusOK
    if response.Status == "unhealthy" {
        statusCode = http.StatusServiceUnavailable
    }
    
    c.JSON(statusCode, response)
}

Docker Health Checks

# API Service Health Check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

# Database Health Check
HEALTHCHECK --interval=30s --timeout=10s --retries=5 \
  CMD pg_isready -U postgres || exit 1

# Frontend Health Check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:80 || exit 1

Prometheus Metrics

Metrics Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kms-api'
    static_configs:
      - targets: ['kms-api:9090']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: 'nginx'
    static_configs:
      - targets: ['kms-nginx:9113']
    
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

Grafana Dashboards

{
  "dashboard": {
    "title": "KMS System Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{status}}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph", 
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          }
        ]
      },
      {
        "title": "Database Connections",
        "type": "graph",
        "targets": [
          {
            "expr": "postgres_stat_database_numbackends",
            "legendFormat": "Active connections"
          }
        ]
      },
      {
        "title": "Authentication Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(auth_attempts_success_total[5m]) / rate(auth_attempts_total[5m])",
            "legendFormat": "Success rate"
          }
        ]
      }
    ]
  }
}

Security Configuration

SSL/TLS Configuration

Certificate Management

# Let's Encrypt certificate generation
docker run --rm -it \
  -v /etc/letsencrypt:/etc/letsencrypt \
  -v /var/lib/letsencrypt:/var/lib/letsencrypt \
  certbot/certbot certonly \
  --standalone \
  --email admin@yourdomain.com \
  --agree-tos \
  --no-eff-email \
  -d kms.yourdomain.com

# Certificate renewal cron job
0 2 * * * docker run --rm -it \
  -v /etc/letsencrypt:/etc/letsencrypt \
  -v /var/lib/letsencrypt:/var/lib/letsencrypt \
  certbot/certbot renew --quiet

SSL Configuration

# Strong SSL configuration
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-CHACHA20-POLY1305;
ssl_prefer_server_ciphers off;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 1d;
ssl_session_tickets off;
ssl_stapling on;
ssl_stapling_verify on;

# HSTS
add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload";

Secrets Management

Docker Secrets

version: '3.8'

secrets:
  db_password:
    file: ./secrets/db_password.txt
  jwt_secret:
    file: ./secrets/jwt_secret.txt
  auth_signing_key:
    file: ./secrets/auth_signing_key.txt

services:
  kms-api:
    image: kms-api:latest
    secrets:
      - db_password
      - jwt_secret
      - auth_signing_key
    environment:
      - DB_PASSWORD_FILE=/run/secrets/db_password
      - JWT_SECRET_FILE=/run/secrets/jwt_secret
      - AUTH_SIGNING_KEY_FILE=/run/secrets/auth_signing_key

HashiCorp Vault Integration

// Vault secret retrieval
func getSecretFromVault(path string) (string, error) {
    client, err := vault.NewClient(&vault.Config{
        Address: os.Getenv("VAULT_ADDR"),
    })
    if err != nil {
        return "", err
    }
    
    client.SetToken(os.Getenv("VAULT_TOKEN"))
    
    secret, err := client.Logical().Read(path)
    if err != nil {
        return "", err
    }
    
    if secret == nil || secret.Data == nil {
        return "", fmt.Errorf("secret not found")
    }
    
    value, ok := secret.Data["value"].(string)
    if !ok {
        return "", fmt.Errorf("invalid secret format")
    }
    
    return value, nil
}

Backup and Recovery

Database Backup Strategy

Automated Backup Script

#!/bin/bash
# backup.sh

set -e

# Configuration
BACKUP_DIR="/backups"
POSTGRES_CONTAINER="kms-postgres"
S3_BUCKET="kms-backups"
RETENTION_DAYS=30

# Create backup directory
mkdir -p $BACKUP_DIR

# Generate backup filename with timestamp
BACKUP_NAME="kms_backup_$(date +%Y%m%d_%H%M%S).sql"
BACKUP_PATH="$BACKUP_DIR/$BACKUP_NAME"

# Create database dump
echo "Creating database backup..."
docker exec $POSTGRES_CONTAINER pg_dump -U postgres kms > $BACKUP_PATH

# Compress backup
gzip $BACKUP_PATH

# Upload to S3
echo "Uploading backup to S3..."
aws s3 cp "$BACKUP_PATH.gz" "s3://$S3_BUCKET/$(date +%Y/%m/%d)/$BACKUP_NAME.gz"

# Clean up local files older than retention period
find $BACKUP_DIR -name "*.gz" -mtime +$RETENTION_DAYS -delete

# Clean up S3 files older than retention period
aws s3 ls "s3://$S3_BUCKET/" --recursive | while read -r line; do
    createDate=$(echo $line | awk '{print $1" "$2}')
    createDate=$(date -d "$createDate" +%s)
    olderThan=$(date -d "$RETENTION_DAYS days ago" +%s)
    
    if [[ $createDate -lt $olderThan ]]; then
        fileName=$(echo $line | awk '{$1=$2=$3=""; print $0}' | sed 's/^[ \t]*//')
        aws s3 rm "s3://$S3_BUCKET/$fileName"
    fi
done

echo "Backup completed successfully"

Backup Cron Job

# Daily backup at 2 AM
0 2 * * * /opt/kms/scripts/backup.sh >> /var/log/kms-backup.log 2>&1

# Weekly full backup at 1 AM on Sundays
0 1 * * 0 /opt/kms/scripts/full-backup.sh >> /var/log/kms-backup.log 2>&1

Disaster Recovery

Recovery Procedure

#!/bin/bash
# restore.sh

set -e

BACKUP_FILE=$1
POSTGRES_CONTAINER="kms-postgres"

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 <backup_file>"
    exit 1
fi

# Download backup from S3 if needed
if [[ $BACKUP_FILE == s3://* ]]; then
    LOCAL_FILE="/tmp/$(basename $BACKUP_FILE)"
    aws s3 cp "$BACKUP_FILE" "$LOCAL_FILE"
    BACKUP_FILE=$LOCAL_FILE
fi

# Extract if compressed
if [[ $BACKUP_FILE == *.gz ]]; then
    gunzip -c "$BACKUP_FILE" > "/tmp/backup.sql"
    BACKUP_FILE="/tmp/backup.sql"
fi

# Stop API services
echo "Stopping API services..."
docker-compose stop kms-api kms-frontend

# Drop and recreate database
echo "Recreating database..."
docker exec $POSTGRES_CONTAINER psql -U postgres -c "DROP DATABASE IF EXISTS kms;"
docker exec $POSTGRES_CONTAINER psql -U postgres -c "CREATE DATABASE kms;"

# Restore database
echo "Restoring database..."
docker exec -i $POSTGRES_CONTAINER psql -U postgres kms < "$BACKUP_FILE"

# Start services
echo "Starting services..."
docker-compose up -d

# Verify restoration
echo "Verifying restoration..."
sleep 30
curl -f http://localhost:8081/health || {
    echo "Health check failed after restoration"
    exit 1
}

echo "Database restoration completed successfully"

Troubleshooting

Common Issues

Database Connection Issues

# Check database container logs
docker logs kms-postgres

# Test database connectivity
docker exec kms-postgres pg_isready -U postgres

# Check connection pool status
curl http://localhost:8080/health | jq '.services.database'

Authentication Problems

# Check JWT secret configuration
docker exec kms-api env | grep JWT_SECRET

# Verify HMAC key configuration
docker exec kms-api env | grep AUTH_SIGNING_KEY

# Test authentication endpoint
curl -X POST http://localhost:8081/api/login \
  -H "Content-Type: application/json" \
  -d '{"app_id": "test-app", "user_id": "test@example.com"}'

Performance Issues

# Check API response times
curl -w "@curl-format.txt" -s -o /dev/null http://localhost:8081/api/health

# Monitor database connections
docker exec kms-postgres psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# Check memory usage
docker stats kms-api

SSL/TLS Issues

# Test SSL certificate
openssl s_client -connect kms.yourdomain.com:443 -servername kms.yourdomain.com

# Check certificate expiration
curl -vI https://kms.yourdomain.com 2>&1 | grep -i expire

# Verify certificate chain
curl -I https://kms.yourdomain.com

Debugging Commands

Container Debugging

# View container logs
docker logs -f kms-api

# Execute shell in container
docker exec -it kms-api /bin/sh

# Inspect container configuration
docker inspect kms-api

# Check resource usage
docker stats

Network Debugging

# Test inter-container connectivity
docker exec kms-api ping kms-postgres

# Check port binding
netstat -tlnp | grep :8080

# Inspect Docker network
docker network inspect kms_kms-network

This deployment guide provides comprehensive instructions for deploying, configuring, and maintaining the KMS in various environments, from development to production-scale deployments.