PagerDuty Incident กับ Progressive Delivery

Progressive Delivery คืออะไรและทำไมถึงสำคัญ

Progressive Delivery เป็นแนวทางการ Deploy Software ที่ค่อยๆเปิดให้ผู้ใช้เข้าถึง Version ใหม่ทีละกลุ่ม แทนที่จะ Deploy ให้ทุกคนพร้อมกัน วิธีนี้ช่วยลดความเสี่ยงเพราะหากเกิดปัญหา จะกระทบเฉพาะผู้ใช้กลุ่มเล็กเท่านั้น และสามารถ Rollback ได้ทันทีก่อนที่ปัญหาจะกระจายไปทั้งระบบ

รูปแบบหลักของ Progressive Delivery ได้แก่ Canary Release ที่เปิดให้ผู้ใช้ส่วันนี้อย (เช่น 5%) ใช้ Version ใหม่ก่อน, Blue-Green Deployment ที่มี Environment สองชุดและสลับ Traffic, Feature Flags ที่เปิด/ปิดฟีเจอร์ได้แบบ Real-time และ A/B Testing ที่ทดสอบ Version ต่างกันกับผู้ใช้ต่างกลุ่ม

PagerDuty เข้ามามีบทบาทสำคัญในฐานะ Incident Management Platform ที่ตรวจจับปัญหาระหว่าง Progressive Rollout และ Trigger การ Rollback อัตโนมัติเมื่อจำเป็น

สถาปัตยกรรม Progressive Delivery กับ PagerDuty

การ Integrate PagerDuty เข้ากับ Progressive Delivery Pipeline ต้องเชื่อมต่อหลาย Component เข้าด้วยกัน

CI/CD Pipeline: GitHub Actions, GitLab CI หรือ ArgoCD ทำหน้าที่ Deploy และจัดการ Rollout
Service Mesh / Ingress: Istio, Linkerd หรือ Nginx ทำหน้าที่ Split Traffic ระหว่าง Canary กับ Stable
Monitoring Stack: Prometheus + Grafana ตรวจจับ Metrics ของ Canary เทียบกับ Baseline
Canary Analysis: Flagger หรือ Argo Rollouts วิเคราะห์ผลการทดสอบ Canary
PagerDuty: รับ Alert จาก Monitoring และ Canary Analysis เพื่อแจ้งเตือนและ Trigger Automation

การตั้งค่า Argo Rollouts กับ PagerDuty

Argo Rollouts เป็นเครื่องมือ Progressive Delivery สำหรับ Kubernetes ที่รองรับ Canary และ Blue-Green Deployment ต่อไปนี้เป็นวิธีตั้งค่าให้ทำงานร่วมกับ PagerDuty

# argo-rollout-canary.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web-frontend
  namespace: production
spec:
  replicas: 10
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: web-frontend
  strategy:
    canary:
      canaryService: web-frontend-canary
      stableService: web-frontend-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: web-frontend-vsvc
              routes:
                - primary
      analysis:
        templates:
          - templateName: canary-analysis
        startingStep: 2
        args:
          - name: service-name
            value: web-frontend-canary
      steps:
        - setWeight: 5
        - pause: { duration: 2m }
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 75
        - pause: { duration: 10m }
        - setWeight: 100
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      containers:
        - name: web-frontend
          image: registry.company.com/web-frontend:v2.1.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

---
# canary-analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-analysis
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 60s
      successCondition: result[0] < 0.05
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

    - name: latency-p99
      interval: 60s
      successCondition: result[0] < 500
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m]))
              by (le)
            ) * 1000

    - name: success-rate
      interval: 60s
      successCondition: result[0] > 0.99
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}", status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

การตั้งค่า PagerDuty Alert สำหรับ Canary Failure

เมื่อ Canary Analysis ล้มเหลว ต้องแจ้งเตือนทีมทันทีผ่าน PagerDuty พร้อมข้อมูลว่า Metric ไหนที่ล้มเหลวและ Rollback ได้สำเร็จหรือไม่

# Python script สำหรับ Webhook ที่รับ Event จาก Argo Rollouts แล้วส่งไป PagerDuty
from flask import Flask, request, jsonify
import requests
import json
from datetime import datetime

app = Flask(__name__)

PAGERDUTY_ROUTING_KEY = "R0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
SLACK_WEBHOOK = "https://hooks.slack.com/services/T00/B00/xxx"

@app.route("/webhook/rollout", methods=["POST"])
def rollout_webhook():
    """รับ Webhook จาก Argo Rollouts Notification"""
    data = request.json
    rollout_name = data.get("name", "unknown")
    phase = data.get("phase", "unknown")
    message = data.get("message", "")

    if phase in ("Degraded", "Failed"):
        # ส่ง PagerDuty Event
        pd_payload = {
            "routing_key": PAGERDUTY_ROUTING_KEY,
            "event_action": "trigger",
            "dedup_key": f"rollout-{rollout_name}-{phase.lower()}",
            "payload": {
                "summary": f"Canary Rollout Failed: {rollout_name} — {message}",
                "severity": "critical",
                "source": "argo-rollouts",
                "component": rollout_name,
                "group": "progressive-delivery",
                "class": "deployment",
                "custom_details": {
                    "rollout": rollout_name,
                    "phase": phase,
                    "message": message,
                    "timestamp": datetime.utcnow().isoformat(),
                    "dashboard": f"https://grafana.company.com/d/rollouts/{rollout_name}"
                }
            },
            "links": [
                {
                    "href": f"https://argocd.company.com/rollouts/{rollout_name}",
                    "text": "Argo Rollouts Dashboard"
                }
            ]
        }
        resp = requests.post(
            "https://events.pagerduty.com/v2/enqueue",
            json=pd_payload, timeout=10
        )
        print(f"PagerDuty response: {resp.status_code}")

        # ส่ง Slack Notification
        slack_msg = {
            "text": f":rotating_light: *Canary Rollout Failed*\n"
                    f"*Rollout:* {rollout_name}\n"
                    f"*Phase:* {phase}\n"
                    f"*Message:* {message}\n"
                    f"*Action:* Automatic rollback triggered"
        }
        requests.post(SLACK_WEBHOOK, json=slack_msg, timeout=10)

    elif phase == "Healthy":
        # Resolve PagerDuty Incident
        resolve_payload = {
            "routing_key": PAGERDUTY_ROUTING_KEY,
            "event_action": "resolve",
            "dedup_key": f"rollout-{rollout_name}-degraded"
        }
        requests.post(
            "https://events.pagerduty.com/v2/enqueue",
            json=resolve_payload, timeout=10
        )

    return jsonify({"status": "ok"}), 200

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=9095)

Blue-Green Deployment กับ PagerDuty

Blue-Green Deployment เป็นอีกรูปแบบของ Progressive Delivery ที่มี Environment สองชุด (Blue = ปัจจุบัน, Green = ใหม่) แล้วสลับ Traffic ทั้งหมดไปยัง Green เมื่อพร้อม ข้อดีคือ Rollback เร็วมากเพราะแค่สลับกลับ แต่ข้อเสียคือถ้ามีปัญหาจะกระทบผู้ใช้ทั้งหมดทันที

# blue-green-rollout.yaml สำหรับ Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-backend
  namespace: production
spec:
  replicas: 5
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: api-backend
  strategy:
    blueGreen:
      activeService: api-backend-active
      previewService: api-backend-preview
      autoPromotionEnabled: false
      scaleDownDelaySeconds: 300
      prePromotionAnalysis:
        templates:
          - templateName: smoke-test
        args:
          - name: preview-url
            value: "http://api-backend-preview.production.svc.cluster.local"
      postPromotionAnalysis:
        templates:
          - templateName: post-deploy-check
        args:
          - name: active-url
            value: "http://api-backend-active.production.svc.cluster.local"
  template:
    metadata:
      labels:
        app: api-backend
    spec:
      containers:
        - name: api-backend
          image: registry.company.com/api-backend:v3.0.0
          ports:
            - containerPort: 3000

---
# smoke-test.yaml — ทดสอบก่อน Promote
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: smoke-test
spec:
  args:
    - name: preview-url
  metrics:
    - name: health-check
      count: 5
      interval: 30s
      successCondition: result == "ok"
      provider:
        job:
          spec:
            template:
              spec:
                containers:
                  - name: smoke-test
                    image: curlimages/curl:latest
                    command:
                      - sh
                      - -c
                      - |
                        STATUS=$(curl -s -o /dev/null -w "%{http_code}" {{args.preview-url}}/healthz)
                        if [ "$STATUS" = "200" ]; then echo "ok"; else echo "fail"; exit 1; fi
                restartPolicy: Never
            backoffLimit: 0

Feature Flag กับ PagerDuty Integration

Feature Flags เป็นอีกวิธีหนึ่งของ Progressive Delivery ที่ช่วยให้สามารถเปิด/ปิดฟีเจอร์ได้แบบ Real-time โดยไม่ต้อง Deploy ใหม่ เมื่อ Feature Flag ทำให้เกิดปัญหา PagerDuty สามารถ Trigger การปิด Flag อัตโนมัติได้

#!/bin/bash # feature-flag-rollback.sh # Script สำหรับ Auto-disable Feature Flag เมื่อได้รับ PagerDuty Webhook set -euo pipefail LAUNCHDARKLY_API_KEY="" PROJECT_KEY="default" ENVIRONMENT_KEY="production" FLAG_KEY="" if [ -z "$FLAG_KEY" ]; then echo "Usage: $0 " exit 1 fi echo "กำลังปิด Feature Flag: " # ปิด Feature Flag ผ่าน LaunchDarkly API curl -s -X PATCH \ "https://app.launchdarkly.com/api/v2/flags//" \ -H "Authorization: " \ -H "Content-Type: application/json; domain-model=launchdarkly.semanticpatch" \ -d "{ \"instructions\": [{ \"kind\": \"turnFlagOff\", \"environmentKey\": \"\" }], \"comment\": \"Auto-disabled by PagerDuty incident response\" }" echo "Feature Flag ถูกปิดเรียบร้อย" # Resolve PagerDuty Incident curl -s -X POST "https://events.pagerduty.com/v2/enqueue" \ -H "Content-Type: application/json" \ -d "{ \"routing_key\": \"R0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\", \"event_action\": \"resolve\", \"dedup_key\": \"feature-flag-\" }" echo "PagerDuty Incident resolved"

Prometheus Alert Rules สำหรับ Progressive Delivery

ต้องมี Alert Rules เฉพาะสำหรับ Canary Traffic เพื่อตรวจจับปัญหาก่อนที่จะ Promote ให้ผู้ใช้ทั้งหมด

# prometheus-progressive-delivery-rules.yml
groups:
  - name: progressive_delivery
    interval: 15s
    rules:
      - alert: CanaryHighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{canary="true", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{canary="true"}[5m]))
          ) > 0.05
        for: 1m
        labels:
          severity: critical
          team: platform
          deployment_type: canary
        annotations:
          summary: "Canary Error Rate สูง {{ $value | humanizePercentage }}"
          description: "Error Rate ของ Canary Version สูงกว่า 5% ควร Rollback ทันที"
          runbook: "https://wiki.company.com/runbooks/canary-rollback"

      - alert: CanaryHighLatency
        expr: |
          (
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{canary="true"}[5m])) by (le)
            )
            /
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{canary="false"}[5m])) by (le)
            )
          ) > 1.5
        for: 2m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Canary P99 Latency สูงกว่า Stable 1.5 เท่า"

      - alert: BlueGreenPostDeployError
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[2m]))
          /
          sum(rate(http_requests_total[2m])) > 0.01
        for: 30s
        labels:
          severity: critical
          team: platform
          deployment_type: blue-green
        annotations:
          summary: "Error Rate หลัง Blue-Green Switch สูง {{ $value | humanizePercentage }}"
          description: "ตรวจพบ Error Rate สูงหลัง Traffic Switch ควรพิจารณา Rollback"

Best Practices สำหรับ Progressive Delivery กับ Incident Management

กำหนด Rollback Criteria ล่วงหน้า: กำหนดเงื่อนไขที่จะ Trigger Rollback อัตโนมัติก่อนเริ่ม Deploy เช่น Error Rate เกิน 5%, P99 Latency เกิน 500ms หรือ Success Rate ต่ำกว่า 99%
ใช้ Automated Canary Analysis: ไม่ควรพึ่งพาการตัดสินใจของคนเพียงอย่างเดียว ใช้เครื่องมืออย่าง Kayenta หรือ Argo Rollouts Analysis ตัดสินใจ Promote/Rollback อัตโนมัติ
ตั้ง Bake Time ที่เหมาะสม: อย่ารีบ Promote Canary เร็วเกินไป ควรรอดู Metrics อย่างน้อย 5-10 นาทีในแต่ละ Step
แยก PagerDuty Service ตามประเภท Deploy: สร้าง Service แยกสำหรับ Canary, Blue-Green และ Feature Flag เพื่อให้ Escalation Policy ต่างกันได้
ทำ Post-mortem ทุกครั้งที่ Canary Fail: วิเคราะห์ว่าทำไม Canary ถึง Fail เพื่อปรับปรุงกระบวนการทดสอบก่อน Deploy
ใช้ Observability ที่ครอบคลุม: ต้อง Monitor ทั้ง Business Metrics (Conversion, Revenue) และ Technical Metrics (Error Rate, Latency) เพราะบางปัญหาไม่แสดงใน Technical Metrics

Progressive Delivery คืออะไรและต่างจาก Continuous Delivery อย่างไร

Progressive Delivery คือการ Deploy แบบค่อยๆเปิดให้ผู้ใช้ทีละกลุ่ม เช่น Canary Release ที่เปิดให้ 5% ก่อนแล้วค่อยเพิ่ม ต่างจาก Continuous Delivery ที่ Deploy ให้ผู้ใช้ทั้งหมดพร้อมกัน ข้อดีคือลดความเสี่ยงและสามารถตรวจจับปัญหาได้ก่อนกระทบผู้ใช้ทั้งหมด

PagerDuty ช่วย Progressive Delivery ได้อย่างไร

PagerDuty ตรวจจับ Error Rate และ Latency ที่เพิ่มขึ้นระหว่าง Canary Release แล้วแจ้งเตือนทีมทันที สามารถ Trigger Automatic Rollback ผ่าน Webhook หรือ Automation Actions เมื่อ Incident เกิดขึ้นระหว่าง Progressive Rollout ช่วยให้ทีมตอบสนองได้เร็วขึ้น

ควรตั้ง Canary Analysis Threshold อย่างไร

ควรเปรียบเทียบ Canary กับ Baseline โดยดู Error Rate ไม่เกิน 1.5 เท่าของ Baseline, P99 Latency ไม่เกิน 1.3 เท่า และ Success Rate ไม่ต่ำกว่า 99% ค่าเหล่านี้ควรปรับตามลักษณะของ Application และ Traffic Pattern จริง

Blue-Green Deployment ต่างจาก Canary อย่างไรในแง่ Incident Management

Blue-Green สลับ Traffic ทั้งหมดไปยัง Environment ใหม่พร้อมกัน ทำให้ Rollback เร็ว (สลับกลับทันที) แต่ผลกระทบกว้างกว่าเพราะผู้ใช้ทั้งหมดโดน Canary ค่อยๆเพิ่ม Traffic ทำให้ผลกระทบจำกัดในกลุ่มเล็ก PagerDuty ต้องตั้ง Threshold ต่างกันตามรูปแบบ

สรุปและแนวทางปฏิบัติ

Progressive Delivery เป็นวิธี Deploy ที่ปลอดภัยและเป็นมาตรฐานสำหรับองค์กรที่ต้องการลดความเสี่ยงในการ Release ซอฟต์แวร์ การ Integrate PagerDuty เข้ากับ Progressive Delivery Pipeline ช่วยให้ทีมสามารถตรวจจับปัญหาได้เร็ว Rollback ได้อัตโนมัติ และมี Incident Response ที่ชัดเจน สิ่งสำคัญคือต้องกำหนด Rollback Criteria ล่วงหน้า ใช้ Automated Analysis ตัดสินใจ และ Review ผลลัพธ์หลังทุก Deploy เพื่อปรับปรุงกระบวนการอย่างต่อเนื่อง