Incident.io ????????? Progressive Delivery ?????????????????????
Incident.io ???????????? incident management platform ?????????????????????????????? respond ????????? incidents ????????????????????????????????? integrate ????????? Slack, PagerDuty, Datadog ??????????????? incident workflow ????????????????????????????????????????????? detection, response, resolution ??????????????? post-mortem
Progressive Delivery ????????? deployment strategy ???????????????????????? rollout changes ??????????????? users ???????????????????????? ????????????????????? metrics ?????????????????????????????? ??????????????????????????????????????? rollback ??????????????????????????? ?????????????????? Canary deployments (????????? traffic 1-5% ?????? version ????????????), Blue-Green deployments (????????????????????????????????? 2 environments), Feature flags (????????????/????????? features ????????? conditions), A/B testing (??????????????? variants ????????? user groups)
?????????????????? Incident.io ????????? Progressive Delivery ???????????? Auto-create incidents ??????????????? canary metrics ?????????????????????, Notify on-call engineer ?????????????????????????????? deployment ?????????????????????, Automated rollback triggers ????????? incident severity, Post-deployment analysis ?????????????????? deployment data ????????? incident data
????????????????????? Incident.io ?????????????????? Deployment Safety
Setup Incident.io integration ????????? deployment pipeline
# === Incident.io Setup for Progressive Delivery ===
# 1. Incident.io API Configuration
cat > incident_config.yaml << 'EOF'
incident_io:
api_key_env: "INCIDENT_IO_API_KEY"
base_url: "https://api.incident.io/v2"
severity_levels:
- id: "sev1"
name: "Critical"
description: "Service completely down"
auto_rollback: true
escalation_minutes: 5
- id: "sev2"
name: "Major"
description: "Significant degradation"
auto_rollback: true
escalation_minutes: 15
- id: "sev3"
name: "Minor"
description: "Partial impact"
auto_rollback: false
escalation_minutes: 30
deployment_triggers:
error_rate_threshold: 1.0 # % error rate
latency_p99_threshold: 2000 # ms
success_rate_threshold: 99.0 # %
canary_failure_threshold: 3 # consecutive failures
integrations:
slack:
channel: "#deployments"
alert_channel: "#incidents"
pagerduty:
service_id: "PXXXXXX"
datadog:
api_key_env: "DD_API_KEY"
app_key_env: "DD_APP_KEY"
argo_rollouts:
enabled: true
namespace: "production"
EOF
# 2. Argo Rollouts with Incident.io webhook
cat > argo-rollout.yaml << 'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-service
namespace: production
spec:
replicas: 10
strategy:
canary:
canaryService: api-canary
stableService: api-stable
trafficRouting:
istio:
virtualServices:
- name: api-vsvc
routes:
- primary
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: canary-analysis
args:
- name: service-name
value: api-service
- setWeight: 20
- pause: {duration: 10m}
- analysis:
templates:
- templateName: canary-analysis
- setWeight: 50
- pause: {duration: 15m}
- analysis:
templates:
- templateName: canary-analysis
- setWeight: 100
rollbackWindow:
revisions: 3
# Webhook to Incident.io on failure
analysis:
successCondition: "result[0] >= 99"
failureLimit: 2
EOF
kubectl apply -f argo-rollout.yaml
echo "Incident.io + Argo Rollouts configured"
??????????????? Progressive Delivery Pipeline
Python pipeline ?????????????????? progressive deployment
#!/usr/bin/env python3
# progressive_delivery.py ??? Progressive Delivery with Incident Management
import json
import logging
import time
import random
from typing import Dict, List, Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("delivery")
class ProgressiveDeliveryPipeline:
"""Progressive delivery with automated incident creation"""
def __init__(self):
self.deployment = None
self.incidents = []
self.metrics_history = []
def start_deployment(self, service, version, strategy="canary"):
"""Start progressive deployment"""
self.deployment = {
"service": service,
"version": version,
"strategy": strategy,
"status": "in_progress",
"started_at": time.time(),
"current_weight": 0,
"steps_completed": 0,
}
logger.info(f"Starting {strategy} deployment: {service} v{version}")
return self.deployment
def check_canary_metrics(self, weight_pct):
"""Check canary health metrics"""
# Simulate metrics collection
metrics = {
"weight_pct": weight_pct,
"error_rate": random.uniform(0, 2),
"latency_p50_ms": random.uniform(50, 150),
"latency_p99_ms": random.uniform(100, 500),
"success_rate": random.uniform(97, 100),
"requests_per_sec": random.randint(100, 500),
}
self.metrics_history.append(metrics)
# Check thresholds
issues = []
if metrics["error_rate"] > 1.0:
issues.append(f"Error rate {metrics['error_rate']:.2f}% exceeds 1.0% threshold")
if metrics["latency_p99_ms"] > 2000:
issues.append(f"P99 latency {metrics['latency_p99_ms']:.0f}ms exceeds 2000ms threshold")
if metrics["success_rate"] < 99.0:
issues.append(f"Success rate {metrics['success_rate']:.2f}% below 99.0% threshold")
return {
"metrics": metrics,
"healthy": len(issues) == 0,
"issues": issues,
}
def create_incident(self, severity, title, description):
"""Create incident via Incident.io API"""
incident = {
"id": f"INC-{len(self.incidents) + 1:04d}",
"severity": severity,
"title": title,
"description": description,
"status": "open",
"created_at": time.time(),
"deployment": self.deployment,
}
self.incidents.append(incident)
logger.warning(f"INCIDENT CREATED: [{severity}] {title}")
# In production: POST to Incident.io API
# requests.post("https://api.incident.io/v2/incidents", json={...})
return incident
def rollback(self, reason):
"""Rollback deployment"""
if self.deployment:
self.deployment["status"] = "rolled_back"
self.deployment["rollback_reason"] = reason
logger.warning(f"ROLLBACK: {self.deployment['service']} ??? {reason}")
return self.deployment
def run_canary(self, service, version, steps=None):
"""Run complete canary deployment"""
if steps is None:
steps = [5, 20, 50, 100]
self.start_deployment(service, version, "canary")
for weight in steps:
logger.info(f"Canary step: {weight}% traffic")
self.deployment["current_weight"] = weight
# Check metrics
check = self.check_canary_metrics(weight)
if not check["healthy"]:
# Create incident and rollback
incident = self.create_incident(
severity="sev2",
title=f"Canary failure: {service} v{version} at {weight}%",
description=f"Issues: {'; '.join(check['issues'])}",
)
self.rollback(f"Canary failed at {weight}%: {check['issues'][0]}")
return {"status": "failed", "incident": incident, "weight": weight}
self.deployment["steps_completed"] += 1
logger.info(f" Metrics OK: error={check['metrics']['error_rate']:.2f}%, p99={check['metrics']['latency_p99_ms']:.0f}ms")
self.deployment["status"] = "completed"
return {"status": "success", "deployment": self.deployment}
# Demo
pipeline = ProgressiveDeliveryPipeline()
result = pipeline.run_canary("api-service", "2.5.0", steps=[5, 20, 50, 100])
print(f"\nDeployment Result: {result['status']}")
if result["status"] == "success":
print(f" Service: {result['deployment']['service']} v{result['deployment']['version']}")
print(f" Steps completed: {result['deployment']['steps_completed']}")
else:
print(f" Incident: {result['incident']['id']} ??? {result['incident']['title']}")
print(f" Failed at: {result['weight']}% traffic")
Automated Rollback ????????? Incident Detection
???????????? rollback ???????????????????????????????????????????????????????????? incident
# === Automated Rollback System ===
# 1. Argo Rollouts Analysis Template
cat > analysis-template.yaml << 'EOF'
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-analysis
namespace: production
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 60s
successCondition: "result[0] <= 1.0"
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",
code=~"5.."}[5m])) /
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m])) * 100
- name: latency-p99
interval: 60s
successCondition: "result[0] <= 2000"
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m]))
by (le)) * 1000
- name: success-rate
interval: 60s
successCondition: "result[0] >= 99.0"
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",
code=~"2.."}[5m])) /
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m])) * 100
# Webhook to Incident.io on failure
- name: incident-webhook
interval: 300s
failureCondition: "true"
provider:
web:
url: "https://api.incident.io/v2/incidents"
method: POST
headers:
- key: Authorization
value: "Bearer {{INCIDENT_IO_API_KEY}}"
body: |
{
"idempotency_key": "deploy-{{args.service-name}}-{{now}}",
"severity_id": "sev2",
"incident_type_id": "deployment_failure",
"name": "Canary failure: {{args.service-name}}",
"summary": "Automated canary analysis detected issues"
}
EOF
# 2. Flagger with Incident.io webhook
cat > flagger-canary.yaml << 'EOF'
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-service
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: incident-notification
type: rollback
url: http://incident-webhook-svc/rollback
metadata:
service: api-service
severity: sev2
EOF
kubectl apply -f analysis-template.yaml
kubectl apply -f flagger-canary.yaml
echo "Automated rollback configured"
Canary Deployment ????????? Feature Flags
????????? canary deployment ????????? feature flag management
#!/usr/bin/env python3
# feature_flag_delivery.py ??? Feature Flags + Progressive Delivery
import json
import logging
import random
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("flags")
class FeatureFlagDelivery:
"""Combine feature flags with progressive delivery"""
def __init__(self):
self.flags = {}
self.metrics = {}
def create_flag(self, name, config):
self.flags[name] = {
"name": name,
"enabled": config.get("enabled", False),
"rollout_pct": config.get("rollout_pct", 0),
"targeting": config.get("targeting", {}),
"kill_switch": False,
}
def evaluate(self, flag_name, user_context):
"""Evaluate feature flag for user"""
flag = self.flags.get(flag_name)
if not flag or not flag["enabled"] or flag["kill_switch"]:
return False
# Check rollout percentage
user_hash = hash(f"{flag_name}:{user_context.get('user_id', '')}") % 100
if user_hash >= flag["rollout_pct"]:
return False
# Check targeting rules
targeting = flag.get("targeting", {})
if "countries" in targeting:
if user_context.get("country") not in targeting["countries"]:
return False
return True
def progressive_rollout(self, flag_name, steps=None):
"""Progressive rollout of feature flag"""
if steps is None:
steps = [1, 5, 10, 25, 50, 100]
results = []
for pct in steps:
self.flags[flag_name]["rollout_pct"] = pct
# Simulate metrics check
error_rate = random.uniform(0, 1.5)
healthy = error_rate < 1.0
result = {
"rollout_pct": pct,
"error_rate": round(error_rate, 3),
"healthy": healthy,
}
results.append(result)
if not healthy:
# Kill switch
self.flags[flag_name]["kill_switch"] = True
logger.warning(f"Kill switch activated for {flag_name} at {pct}%")
break
logger.info(f"{flag_name}: {pct}% rollout OK (error={error_rate:.3f}%)")
return results
def deployment_strategy(self):
return {
"phase_1_internal": {
"description": "Deploy to internal employees only",
"duration": "1 day",
"targeting": {"group": "internal"},
"rollout": "100% of internal users",
"metrics": ["error_rate", "latency", "user_feedback"],
},
"phase_2_canary": {
"description": "1% of production traffic",
"duration": "2 hours",
"rollout": "1%",
"auto_rollback": True,
},
"phase_3_gradual": {
"description": "Gradual increase to 100%",
"steps": [5, 10, 25, 50, 100],
"pause_between": "30 minutes",
"auto_rollback": True,
},
}
delivery = FeatureFlagDelivery()
delivery.create_flag("new_checkout_flow", {
"enabled": True,
"rollout_pct": 0,
"targeting": {"countries": ["TH", "SG", "MY"]},
})
results = delivery.progressive_rollout("new_checkout_flow")
print("Progressive Rollout Results:")
for r in results:
status = "OK" if r["healthy"] else "FAILED"
print(f" {r['rollout_pct']}%: error={r['error_rate']}% [{status}]")
strategy = delivery.deployment_strategy()
print("\nDeployment Strategy:")
for phase, info in strategy.items():
print(f" {phase}: {info['description']}")
Monitoring ????????? Post-Incident Analysis
?????????????????????????????????????????????????????????????????? incident
#!/usr/bin/env python3
# incident_analytics.py ??? Incident Analytics Dashboard
import json
import logging
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("analytics")
class IncidentAnalytics:
def __init__(self):
pass
def dashboard(self):
return {
"deployment_metrics_30d": {
"total_deployments": 45,
"successful": 42,
"rolled_back": 3,
"success_rate": "93.3%",
"avg_rollout_duration_min": 35,
"incidents_from_deployments": 3,
},
"incident_metrics_30d": {
"total_incidents": 8,
"deployment_related": 3,
"mttr_minutes": 18,
"mttr_trend": "Improving (was 32 min last month)",
"by_severity": {"sev1": 0, "sev2": 2, "sev3": 6},
},
"recent_incidents": [
{
"id": "INC-0042",
"title": "Canary failure: checkout-service v3.2.0",
"severity": "sev2",
"duration_min": 8,
"action": "Auto-rollback at 5% traffic",
"root_cause": "Database connection pool exhaustion",
"status": "resolved",
},
{
"id": "INC-0039",
"title": "Elevated error rate after feature flag rollout",
"severity": "sev3",
"duration_min": 22,
"action": "Kill switch activated, flag disabled",
"root_cause": "Missing null check in new code path",
"status": "resolved",
},
],
"improvement_actions": [
"Add database connection pool monitoring to canary analysis",
"Increase unit test coverage for checkout-service to 85%",
"Add null safety checks to feature flag code paths",
"Reduce canary analysis interval from 5m to 2m for faster detection",
],
}
analytics = IncidentAnalytics()
dash = analytics.dashboard()
deploy = dash["deployment_metrics_30d"]
print(f"Deployment & Incident Dashboard (30d):")
print(f" Deployments: {deploy['total_deployments']} ({deploy['success_rate']} success)")
print(f" Rolled back: {deploy['rolled_back']}")
print(f" Avg rollout: {deploy['avg_rollout_duration_min']} min")
incident = dash["incident_metrics_30d"]
print(f"\nIncidents: {incident['total_incidents']} (MTTR: {incident['mttr_minutes']} min)")
print(f" Trend: {incident['mttr_trend']}")
print(f"\nRecent Incidents:")
for inc in dash["recent_incidents"]:
print(f" [{inc['severity']}] {inc['id']}: {inc['title']}")
print(f" Duration: {inc['duration_min']}m, Action: {inc['action']}")
print(f"\nImprovement Actions:")
for action in dash["improvement_actions"]:
print(f" - {action}")
FAQ ??????????????????????????????????????????
Q: Incident.io ????????? PagerDuty ????????? Opsgenie ??????????????????????????????????
A: Incident.io ???????????? incident lifecycle management ????????????????????????????????? Slack-native workflow ??????????????? incident channel ??????????????????????????? timeline tracking ??????????????? post-mortem templates ??????????????? teams ?????????????????? Slack ???????????????????????? PagerDuty ???????????? on-call management ????????????????????????????????? alerting, escalation, scheduling mature ?????????????????? (15+ ??????) integrations ??????????????????????????? ??????????????? large enterprises Opsgenie (Atlassian) ???????????? integration ????????? Jira, Confluence ????????????????????????????????? PagerDuty ???????????????????????? Atlassian ecosystem ??????????????? teams ?????????????????? Jira ???????????????????????? ?????????????????? progressive delivery integration ???????????? 3 ??????????????????????????? webhooks ?????????????????????????????? Argo Rollouts/Flagger ?????????
Q: Canary deployment ????????? Blue-Green ??????????????????????????????????
A: Canary ??????????????? shift traffic ???????????????????????? (1% ??? 5% ??? 20% ??? 100%) ????????????????????????????????????????????????????????? impact ???????????? rollback ???????????? ????????????????????????????????? ???????????? traffic splitting ??????????????? services ??????????????? traffic ????????? ???????????? minimize risk Blue-Green maintain 2 environments (blue=current, green=new) switch traffic ?????????????????????????????????????????? simple ?????????????????????????????? ????????? all-or-nothing ?????????????????????????????? impact ????????? users ???????????? double infrastructure cost ??????????????? services ????????? traffic ???????????? ???????????? batch processing ??????????????? Canary ?????????????????? customer-facing services, Blue-Green ?????????????????? internal/batch services
Q: Argo Rollouts ????????? Flagger ??????????????????????????????????
A: Argo Rollouts ???????????? Kubernetes controller ?????????????????? progressive delivery CRD-based (Rollout ????????? Deployment) ?????????????????? canary, blue-green, experiment analysis templates + metrics providers integrate ????????? Argo CD ?????? Flagger ???????????? progressive delivery operator by Weaveworks ???????????????????????? existing Deployments (?????????????????????????????????????????? CRD) ?????????????????? Istio, Linkerd, Nginx, Contour webhooks + alerting built-in ??????????????????????????? ?????????????????? Argo CD ???????????????????????? ??????????????? Argo Rollouts ????????????????????? minimal changes ????????????????????? ????????????????????? CRD ??????????????? Flagger
Q: MTTR (Mean Time to Recovery) ?????????????????????????????????????
A: MTTR ?????????????????????????????? Detection time + Response time + Resolution time ??????????????????????????? Detection ???????????????????????? ????????? automated canary analysis ??????????????????????????? 1-2 ???????????? ????????? manual monitoring, Alert routing ????????? alert ?????????????????????????????????????????? (PagerDuty/Incident.io routing rules), Automated rollback rollback ?????????????????????????????????????????? metrics ????????????????????? ??????????????????????????? manual decision, Runbooks ??????????????? runbooks ?????????????????? common issues ?????? investigation time, Practice ?????? game days/chaos engineering ????????? team response, Post-mortems ????????????????????????????????? incidents ???????????????????????????????????????????????????????????? ?????????????????? good MTTR ?????????????????? progressive delivery ?????????????????????????????? 15 ???????????? ????????? automated rollback ?????????????????????????????? 5 ????????????