SiamCafe.net Blog
Technology

Incident.io Progressive Delivery Canary Deployment พรอม Automated Rollback

Incident.io Progressive Delivery | SiamCafe Blog
2025-06-08· อ. บอม — SiamCafe.net· 1,326 คำ

Incident.io ????????? Progressive Delivery ?????????????????????

Incident.io ???????????? incident management platform ?????????????????????????????? respond ????????? incidents ????????????????????????????????? integrate ????????? Slack, PagerDuty, Datadog ??????????????? incident workflow ????????????????????????????????????????????? detection, response, resolution ??????????????? post-mortem

Progressive Delivery ????????? deployment strategy ???????????????????????? rollout changes ??????????????? users ???????????????????????? ????????????????????? metrics ?????????????????????????????? ??????????????????????????????????????? rollback ??????????????????????????? ?????????????????? Canary deployments (????????? traffic 1-5% ?????? version ????????????), Blue-Green deployments (????????????????????????????????? 2 environments), Feature flags (????????????/????????? features ????????? conditions), A/B testing (??????????????? variants ????????? user groups)

?????????????????? Incident.io ????????? Progressive Delivery ???????????? Auto-create incidents ??????????????? canary metrics ?????????????????????, Notify on-call engineer ?????????????????????????????? deployment ?????????????????????, Automated rollback triggers ????????? incident severity, Post-deployment analysis ?????????????????? deployment data ????????? incident data

????????????????????? Incident.io ?????????????????? Deployment Safety

Setup Incident.io integration ????????? deployment pipeline

# === Incident.io Setup for Progressive Delivery ===

# 1. Incident.io API Configuration
cat > incident_config.yaml << 'EOF'
incident_io:
  api_key_env: "INCIDENT_IO_API_KEY"
  base_url: "https://api.incident.io/v2"
  
  severity_levels:
    - id: "sev1"
      name: "Critical"
      description: "Service completely down"
      auto_rollback: true
      escalation_minutes: 5
    - id: "sev2"
      name: "Major"
      description: "Significant degradation"
      auto_rollback: true
      escalation_minutes: 15
    - id: "sev3"
      name: "Minor"
      description: "Partial impact"
      auto_rollback: false
      escalation_minutes: 30

  deployment_triggers:
    error_rate_threshold: 1.0      # % error rate
    latency_p99_threshold: 2000    # ms
    success_rate_threshold: 99.0   # %
    canary_failure_threshold: 3    # consecutive failures

  integrations:
    slack:
      channel: "#deployments"
      alert_channel: "#incidents"
    pagerduty:
      service_id: "PXXXXXX"
    datadog:
      api_key_env: "DD_API_KEY"
      app_key_env: "DD_APP_KEY"
    argo_rollouts:
      enabled: true
      namespace: "production"
EOF

# 2. Argo Rollouts with Incident.io webhook
cat > argo-rollout.yaml << 'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: api-vsvc
              routes:
                - primary
      steps:
        - setWeight: 5
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: canary-analysis
            args:
              - name: service-name
                value: api-service
        - setWeight: 20
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: canary-analysis
        - setWeight: 50
        - pause: {duration: 15m}
        - analysis:
            templates:
              - templateName: canary-analysis
        - setWeight: 100
      
      rollbackWindow:
        revisions: 3
      
      # Webhook to Incident.io on failure
      analysis:
        successCondition: "result[0] >= 99"
        failureLimit: 2
EOF

kubectl apply -f argo-rollout.yaml
echo "Incident.io + Argo Rollouts configured"

??????????????? Progressive Delivery Pipeline

Python pipeline ?????????????????? progressive deployment

#!/usr/bin/env python3
# progressive_delivery.py ??? Progressive Delivery with Incident Management
import json
import logging
import time
import random
from typing import Dict, List, Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("delivery")

class ProgressiveDeliveryPipeline:
    """Progressive delivery with automated incident creation"""
    
    def __init__(self):
        self.deployment = None
        self.incidents = []
        self.metrics_history = []
    
    def start_deployment(self, service, version, strategy="canary"):
        """Start progressive deployment"""
        self.deployment = {
            "service": service,
            "version": version,
            "strategy": strategy,
            "status": "in_progress",
            "started_at": time.time(),
            "current_weight": 0,
            "steps_completed": 0,
        }
        logger.info(f"Starting {strategy} deployment: {service} v{version}")
        return self.deployment
    
    def check_canary_metrics(self, weight_pct):
        """Check canary health metrics"""
        # Simulate metrics collection
        metrics = {
            "weight_pct": weight_pct,
            "error_rate": random.uniform(0, 2),
            "latency_p50_ms": random.uniform(50, 150),
            "latency_p99_ms": random.uniform(100, 500),
            "success_rate": random.uniform(97, 100),
            "requests_per_sec": random.randint(100, 500),
        }
        
        self.metrics_history.append(metrics)
        
        # Check thresholds
        issues = []
        if metrics["error_rate"] > 1.0:
            issues.append(f"Error rate {metrics['error_rate']:.2f}% exceeds 1.0% threshold")
        if metrics["latency_p99_ms"] > 2000:
            issues.append(f"P99 latency {metrics['latency_p99_ms']:.0f}ms exceeds 2000ms threshold")
        if metrics["success_rate"] < 99.0:
            issues.append(f"Success rate {metrics['success_rate']:.2f}% below 99.0% threshold")
        
        return {
            "metrics": metrics,
            "healthy": len(issues) == 0,
            "issues": issues,
        }
    
    def create_incident(self, severity, title, description):
        """Create incident via Incident.io API"""
        incident = {
            "id": f"INC-{len(self.incidents) + 1:04d}",
            "severity": severity,
            "title": title,
            "description": description,
            "status": "open",
            "created_at": time.time(),
            "deployment": self.deployment,
        }
        self.incidents.append(incident)
        logger.warning(f"INCIDENT CREATED: [{severity}] {title}")
        
        # In production: POST to Incident.io API
        # requests.post("https://api.incident.io/v2/incidents", json={...})
        
        return incident
    
    def rollback(self, reason):
        """Rollback deployment"""
        if self.deployment:
            self.deployment["status"] = "rolled_back"
            self.deployment["rollback_reason"] = reason
            logger.warning(f"ROLLBACK: {self.deployment['service']} ??? {reason}")
        return self.deployment
    
    def run_canary(self, service, version, steps=None):
        """Run complete canary deployment"""
        if steps is None:
            steps = [5, 20, 50, 100]
        
        self.start_deployment(service, version, "canary")
        
        for weight in steps:
            logger.info(f"Canary step: {weight}% traffic")
            self.deployment["current_weight"] = weight
            
            # Check metrics
            check = self.check_canary_metrics(weight)
            
            if not check["healthy"]:
                # Create incident and rollback
                incident = self.create_incident(
                    severity="sev2",
                    title=f"Canary failure: {service} v{version} at {weight}%",
                    description=f"Issues: {'; '.join(check['issues'])}",
                )
                self.rollback(f"Canary failed at {weight}%: {check['issues'][0]}")
                return {"status": "failed", "incident": incident, "weight": weight}
            
            self.deployment["steps_completed"] += 1
            logger.info(f"  Metrics OK: error={check['metrics']['error_rate']:.2f}%, p99={check['metrics']['latency_p99_ms']:.0f}ms")
        
        self.deployment["status"] = "completed"
        return {"status": "success", "deployment": self.deployment}

# Demo
pipeline = ProgressiveDeliveryPipeline()
result = pipeline.run_canary("api-service", "2.5.0", steps=[5, 20, 50, 100])

print(f"\nDeployment Result: {result['status']}")
if result["status"] == "success":
    print(f"  Service: {result['deployment']['service']} v{result['deployment']['version']}")
    print(f"  Steps completed: {result['deployment']['steps_completed']}")
else:
    print(f"  Incident: {result['incident']['id']} ??? {result['incident']['title']}")
    print(f"  Failed at: {result['weight']}% traffic")

Automated Rollback ????????? Incident Detection

???????????? rollback ???????????????????????????????????????????????????????????? incident

# === Automated Rollback System ===

# 1. Argo Rollouts Analysis Template
cat > analysis-template.yaml << 'EOF'
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-analysis
  namespace: production
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 60s
      successCondition: "result[0] <= 1.0"
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",
            code=~"5.."}[5m])) /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m])) * 100

    - name: latency-p99
      interval: 60s
      successCondition: "result[0] <= 2000"
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m]))
            by (le)) * 1000

    - name: success-rate
      interval: 60s
      successCondition: "result[0] >= 99.0"
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",
            code=~"2.."}[5m])) /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m])) * 100

    # Webhook to Incident.io on failure
    - name: incident-webhook
      interval: 300s
      failureCondition: "true"
      provider:
        web:
          url: "https://api.incident.io/v2/incidents"
          method: POST
          headers:
            - key: Authorization
              value: "Bearer {{INCIDENT_IO_API_KEY}}"
          body: |
            {
              "idempotency_key": "deploy-{{args.service-name}}-{{now}}",
              "severity_id": "sev2",
              "incident_type_id": "deployment_failure",
              "name": "Canary failure: {{args.service-name}}",
              "summary": "Automated canary analysis detected issues"
            }
EOF

# 2. Flagger with Incident.io webhook
cat > flagger-canary.yaml << 'EOF'
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    webhooks:
      - name: incident-notification
        type: rollback
        url: http://incident-webhook-svc/rollback
        metadata:
          service: api-service
          severity: sev2
EOF

kubectl apply -f analysis-template.yaml
kubectl apply -f flagger-canary.yaml
echo "Automated rollback configured"

Canary Deployment ????????? Feature Flags

????????? canary deployment ????????? feature flag management

#!/usr/bin/env python3
# feature_flag_delivery.py ??? Feature Flags + Progressive Delivery
import json
import logging
import random
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("flags")

class FeatureFlagDelivery:
    """Combine feature flags with progressive delivery"""
    
    def __init__(self):
        self.flags = {}
        self.metrics = {}
    
    def create_flag(self, name, config):
        self.flags[name] = {
            "name": name,
            "enabled": config.get("enabled", False),
            "rollout_pct": config.get("rollout_pct", 0),
            "targeting": config.get("targeting", {}),
            "kill_switch": False,
        }
    
    def evaluate(self, flag_name, user_context):
        """Evaluate feature flag for user"""
        flag = self.flags.get(flag_name)
        if not flag or not flag["enabled"] or flag["kill_switch"]:
            return False
        
        # Check rollout percentage
        user_hash = hash(f"{flag_name}:{user_context.get('user_id', '')}") % 100
        if user_hash >= flag["rollout_pct"]:
            return False
        
        # Check targeting rules
        targeting = flag.get("targeting", {})
        if "countries" in targeting:
            if user_context.get("country") not in targeting["countries"]:
                return False
        
        return True
    
    def progressive_rollout(self, flag_name, steps=None):
        """Progressive rollout of feature flag"""
        if steps is None:
            steps = [1, 5, 10, 25, 50, 100]
        
        results = []
        for pct in steps:
            self.flags[flag_name]["rollout_pct"] = pct
            
            # Simulate metrics check
            error_rate = random.uniform(0, 1.5)
            healthy = error_rate < 1.0
            
            result = {
                "rollout_pct": pct,
                "error_rate": round(error_rate, 3),
                "healthy": healthy,
            }
            results.append(result)
            
            if not healthy:
                # Kill switch
                self.flags[flag_name]["kill_switch"] = True
                logger.warning(f"Kill switch activated for {flag_name} at {pct}%")
                break
            
            logger.info(f"{flag_name}: {pct}% rollout OK (error={error_rate:.3f}%)")
        
        return results
    
    def deployment_strategy(self):
        return {
            "phase_1_internal": {
                "description": "Deploy to internal employees only",
                "duration": "1 day",
                "targeting": {"group": "internal"},
                "rollout": "100% of internal users",
                "metrics": ["error_rate", "latency", "user_feedback"],
            },
            "phase_2_canary": {
                "description": "1% of production traffic",
                "duration": "2 hours",
                "rollout": "1%",
                "auto_rollback": True,
            },
            "phase_3_gradual": {
                "description": "Gradual increase to 100%",
                "steps": [5, 10, 25, 50, 100],
                "pause_between": "30 minutes",
                "auto_rollback": True,
            },
        }

delivery = FeatureFlagDelivery()
delivery.create_flag("new_checkout_flow", {
    "enabled": True,
    "rollout_pct": 0,
    "targeting": {"countries": ["TH", "SG", "MY"]},
})

results = delivery.progressive_rollout("new_checkout_flow")
print("Progressive Rollout Results:")
for r in results:
    status = "OK" if r["healthy"] else "FAILED"
    print(f"  {r['rollout_pct']}%: error={r['error_rate']}% [{status}]")

strategy = delivery.deployment_strategy()
print("\nDeployment Strategy:")
for phase, info in strategy.items():
    print(f"  {phase}: {info['description']}")

Monitoring ????????? Post-Incident Analysis

?????????????????????????????????????????????????????????????????? incident

#!/usr/bin/env python3
# incident_analytics.py ??? Incident Analytics Dashboard
import json
import logging
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("analytics")

class IncidentAnalytics:
    def __init__(self):
        pass
    
    def dashboard(self):
        return {
            "deployment_metrics_30d": {
                "total_deployments": 45,
                "successful": 42,
                "rolled_back": 3,
                "success_rate": "93.3%",
                "avg_rollout_duration_min": 35,
                "incidents_from_deployments": 3,
            },
            "incident_metrics_30d": {
                "total_incidents": 8,
                "deployment_related": 3,
                "mttr_minutes": 18,
                "mttr_trend": "Improving (was 32 min last month)",
                "by_severity": {"sev1": 0, "sev2": 2, "sev3": 6},
            },
            "recent_incidents": [
                {
                    "id": "INC-0042",
                    "title": "Canary failure: checkout-service v3.2.0",
                    "severity": "sev2",
                    "duration_min": 8,
                    "action": "Auto-rollback at 5% traffic",
                    "root_cause": "Database connection pool exhaustion",
                    "status": "resolved",
                },
                {
                    "id": "INC-0039",
                    "title": "Elevated error rate after feature flag rollout",
                    "severity": "sev3",
                    "duration_min": 22,
                    "action": "Kill switch activated, flag disabled",
                    "root_cause": "Missing null check in new code path",
                    "status": "resolved",
                },
            ],
            "improvement_actions": [
                "Add database connection pool monitoring to canary analysis",
                "Increase unit test coverage for checkout-service to 85%",
                "Add null safety checks to feature flag code paths",
                "Reduce canary analysis interval from 5m to 2m for faster detection",
            ],
        }

analytics = IncidentAnalytics()
dash = analytics.dashboard()
deploy = dash["deployment_metrics_30d"]
print(f"Deployment & Incident Dashboard (30d):")
print(f"  Deployments: {deploy['total_deployments']} ({deploy['success_rate']} success)")
print(f"  Rolled back: {deploy['rolled_back']}")
print(f"  Avg rollout: {deploy['avg_rollout_duration_min']} min")

incident = dash["incident_metrics_30d"]
print(f"\nIncidents: {incident['total_incidents']} (MTTR: {incident['mttr_minutes']} min)")
print(f"  Trend: {incident['mttr_trend']}")

print(f"\nRecent Incidents:")
for inc in dash["recent_incidents"]:
    print(f"  [{inc['severity']}] {inc['id']}: {inc['title']}")
    print(f"    Duration: {inc['duration_min']}m, Action: {inc['action']}")

print(f"\nImprovement Actions:")
for action in dash["improvement_actions"]:
    print(f"  - {action}")

FAQ ??????????????????????????????????????????

Q: Incident.io ????????? PagerDuty ????????? Opsgenie ??????????????????????????????????

A: Incident.io ???????????? incident lifecycle management ????????????????????????????????? Slack-native workflow ??????????????? incident channel ??????????????????????????? timeline tracking ??????????????? post-mortem templates ??????????????? teams ?????????????????? Slack ???????????????????????? PagerDuty ???????????? on-call management ????????????????????????????????? alerting, escalation, scheduling mature ?????????????????? (15+ ??????) integrations ??????????????????????????? ??????????????? large enterprises Opsgenie (Atlassian) ???????????? integration ????????? Jira, Confluence ????????????????????????????????? PagerDuty ???????????????????????? Atlassian ecosystem ??????????????? teams ?????????????????? Jira ???????????????????????? ?????????????????? progressive delivery integration ???????????? 3 ??????????????????????????? webhooks ?????????????????????????????? Argo Rollouts/Flagger ?????????

Q: Canary deployment ????????? Blue-Green ??????????????????????????????????

A: Canary ??????????????? shift traffic ???????????????????????? (1% ??? 5% ??? 20% ??? 100%) ????????????????????????????????????????????????????????? impact ???????????? rollback ???????????? ????????????????????????????????? ???????????? traffic splitting ??????????????? services ??????????????? traffic ????????? ???????????? minimize risk Blue-Green maintain 2 environments (blue=current, green=new) switch traffic ?????????????????????????????????????????? simple ?????????????????????????????? ????????? all-or-nothing ?????????????????????????????? impact ????????? users ???????????? double infrastructure cost ??????????????? services ????????? traffic ???????????? ???????????? batch processing ??????????????? Canary ?????????????????? customer-facing services, Blue-Green ?????????????????? internal/batch services

Q: Argo Rollouts ????????? Flagger ??????????????????????????????????

A: Argo Rollouts ???????????? Kubernetes controller ?????????????????? progressive delivery CRD-based (Rollout ????????? Deployment) ?????????????????? canary, blue-green, experiment analysis templates + metrics providers integrate ????????? Argo CD ?????? Flagger ???????????? progressive delivery operator by Weaveworks ???????????????????????? existing Deployments (?????????????????????????????????????????? CRD) ?????????????????? Istio, Linkerd, Nginx, Contour webhooks + alerting built-in ??????????????????????????? ?????????????????? Argo CD ???????????????????????? ??????????????? Argo Rollouts ????????????????????? minimal changes ????????????????????? ????????????????????? CRD ??????????????? Flagger

Q: MTTR (Mean Time to Recovery) ?????????????????????????????????????

A: MTTR ?????????????????????????????? Detection time + Response time + Resolution time ??????????????????????????? Detection ???????????????????????? ????????? automated canary analysis ??????????????????????????? 1-2 ???????????? ????????? manual monitoring, Alert routing ????????? alert ?????????????????????????????????????????? (PagerDuty/Incident.io routing rules), Automated rollback rollback ?????????????????????????????????????????? metrics ????????????????????? ??????????????????????????? manual decision, Runbooks ??????????????? runbooks ?????????????????? common issues ?????? investigation time, Practice ?????? game days/chaos engineering ????????? team response, Post-mortems ????????????????????????????????? incidents ???????????????????????????????????????????????????????????? ?????????????????? good MTTR ?????????????????? progressive delivery ?????????????????????????????? 15 ???????????? ????????? automated rollback ?????????????????????????????? 5 ????????????

📖 บทความที่เกี่ยวข้อง

Apache Flink Streaming Progressive Deliveryอ่านบทความ → Azure Container Apps Progressive Deliveryอ่านบทความ → CrewAI Multi-Agent Progressive Deliveryอ่านบทความ → CSS Nesting Progressive Deliveryอ่านบทความ → Azure Front Door Progressive Deliveryอ่านบทความ →

📚 ดูบทความทั้งหมด →