SiamCafe · Blog
CircleCI Orbs Post-mortem Analysis — วิเคราะห์
บทความ

CircleCI Orbs Post-mortem Analysis — วิเคราะห์

เผยแพร่ 28 พฤษภาคม 2569

Post-mortem + CircleCI

CircleCI Orbs Post-mortem Analysis Incident Timeline Root Cause Blameless Culture Action Items Prevention Deploy CI/CD Pipeline Monitoring Alert Automation

PhaseDurationActivitiesOwnerOutput
Detection0-5 minAlert fired, on-call notifiedMonitoringIncident declared
Triage5-15 minAssess impact, assign severityOn-call engineerSeverity level
Mitigation15-60 minRollback, hotfix, or workaroundIncident commanderService restored
Resolution1-4 hoursRoot cause fix deployedDev teamPermanent fix
Post-mortem1-3 days afterAnalysis meeting, documentTeam leadPost-mortem doc
Follow-up1-2 sprintsAction items completedAssigned ownersPrevention measures

Incident Analysis with CI/CD Data

# === CircleCI Incident Analysis ===

# CircleCI API — Get recent builds
# curl -H "Circle-Token: $CIRCLE_TOKEN" \
#   "https://circleci.com/api/v2/project/gh/org/repo/pipeline?branch=main" | \
#   jq '.items[:10] | .[] | {id: .id, state: .state, created_at: .created_at}'

# Get workflow details
# curl -H "Circle-Token: $CIRCLE_TOKEN" \
#   "https://circleci.com/api/v2/pipeline/$PIPELINE_ID/workflow" | \
#   jq '.items[] | {name: .name, status: .status, duration: .duration}'

# .circleci/config.yml with post-mortem orb
# orbs:
#   slack: circleci/slack@4.12.0
#   rollback: my-org/rollback@1.0.0
#
# jobs:
#   deploy:
#     steps:
#       - deploy-to-production
#       - health-check:
#           url: https://api.myapp.com/health
#           retries: 5
#           interval: 10s
#       - rollback/auto:
#           when: on_fail
#           version: previous
#       - slack/notify:
#           event: fail
#           channel: incidents
#           template: DEPLOY_FAILED

from dataclasses import dataclass

@dataclass
class IncidentTimeline:
    time: str
    event: str
    source: str
    impact: str

timeline = [
    IncidentTimeline("14:00", "Deploy #1234 triggered (main branch)", "CircleCI", "None"),
    IncidentTimeline("14:05", "Deploy completed, health check passed", "CircleCI", "None"),
    IncidentTimeline("14:12", "Error rate spike 5% → 25%", "Datadog alert", "Users see 500 errors"),
    IncidentTimeline("14:15", "On-call paged, incident declared SEV-2", "PagerDuty", "25% requests failing"),
    IncidentTimeline("14:20", "Root cause identified: DB migration issue", "Engineer", "Ongoing"),
    IncidentTimeline("14:25", "Rollback initiated via CircleCI", "CircleCI", "Reducing"),
    IncidentTimeline("14:30", "Rollback complete, error rate back to 1%", "Datadog", "Resolved"),
    IncidentTimeline("14:45", "All clear, monitoring continues", "Team", "None"),
]

print("=== Incident Timeline ===")
for t in timeline:
    print(f"  [{t.time}] {t.event}")
    print(f"    Source: {t.source} | Impact: {t.impact}")

Root Cause Analysis

# === 5 Whys Analysis ===

@dataclass
class WhyStep:
    level: int
    question: str
    answer: str

five_whys = [
    WhyStep(1, "Why did users see 500 errors?",
        "Database queries failed due to missing column"),
    WhyStep(2, "Why was the column missing?",
        "DB migration ran but was incompatible with old code"),
    WhyStep(3, "Why was incompatible migration deployed?",
        "Migration and code change were in separate deploys"),
    WhyStep(4, "Why were they in separate deploys?",
        "No process requiring migration + code in same PR"),
    WhyStep(5, "Why was there no such process?",
        "Deploy guidelines didn't cover DB migration ordering"),
]

print("=== 5 Whys ===")
for w in five_whys:
    print(f"  Why #{w.level}: {w.question}")
    print(f"    → {w.answer}")

# Action Items
@dataclass
class ActionItem:
    action: str
    priority: str
    owner: str
    deadline: str
    status: str

actions = [
    ActionItem("Add DB migration check to CI pipeline", "P0", "DevOps team", "This sprint", "In progress"),
    ActionItem("Create deploy runbook for DB changes", "P0", "Tech lead", "This sprint", "Todo"),
    ActionItem("Add integration test for DB schema", "P1", "Backend team", "Next sprint", "Todo"),
    ActionItem("Implement canary deploy (10% → 50% → 100%)", "P1", "DevOps team", "Next sprint", "Todo"),
    ActionItem("Add auto-rollback on error rate > 5%", "P1", "DevOps team", "Next sprint", "Todo"),
    ActionItem("Update post-mortem template with CI/CD section", "P2", "Team lead", "Next sprint", "Todo"),
]

print(f"\n\n=== Action Items ===")
for a in actions:
    print(f"  [{a.priority}] {a.action}")
    print(f"    Owner: {a.owner} | Deadline: {a.deadline} | Status: {a.status}")

Prevention Automation

# === Automated Prevention with CircleCI ===

# Canary Deploy Config
# jobs:
#   canary-deploy:
#     steps:
#       - deploy-canary:
#           percentage: 10
#       - wait: { duration: 5m }
#       - check-metrics:
#           error_threshold: 2%
#           latency_threshold: 500ms
#       - deploy-full:
#           when: metrics_pass
#       - rollback:
#           when: metrics_fail
#
# Health Check Orb
# orbs:
#   health: my-org/health-check@1.0.0
# jobs:
#   post-deploy:
#     steps:
#       - health/check:
#           endpoints:
#             - url: https://api.myapp.com/health
#               expected_status: 200
#             - url: https://api.myapp.com/db/health
#               expected_status: 200
#           timeout: 30s
#           retries: 3

@dataclass
class PreventionMeasure:
    measure: str
    trigger: str
    automation: str
    effectiveness: str

measures = [
    PreventionMeasure("Pre-deploy DB check", "Migration file detected in PR",
        "CI job validates migration compatibility", "Catches 80% of DB issues"),
    PreventionMeasure("Canary deploy", "Every production deploy",
        "10% traffic → check metrics → full deploy", "Limits blast radius to 10%"),
    PreventionMeasure("Auto rollback", "Error rate > 5% post-deploy",
        "CircleCI triggers rollback pipeline", "MTTR reduced from 30min to 5min"),
    PreventionMeasure("Health check gate", "After every deploy",
        "Hit /health endpoint, fail if not 200", "Catches service startup issues"),
    PreventionMeasure("Slack incident bot", "Deploy failure or rollback",
        "Auto-create incident channel, notify team", "Faster coordination"),
    PreventionMeasure("Post-mortem reminder", "3 days after incident",
        "Slack reminder to schedule post-mortem meeting", "Ensures follow-through"),
]

print("Prevention Measures:")
for m in measures:
    print(f"  [{m.measure}] Trigger: {m.trigger}")
    print(f"    Automation: {m.automation}")
    print(f"    Impact: {m.effectiveness}")

# DORA Metrics tracking
dora = {
    "Deployment Frequency": "6.2/day → target 10/day",
    "Lead Time for Changes": "2.5 hours → target 1 hour",
    "MTTR": "22 min → target 15 min (auto-rollback helps)",
    "Change Failure Rate": "8% → target 5% (canary deploy helps)",
}

print(f"\n\nDORA Metrics:")
for k, v in dora.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

  • Blameless: Post-mortem ต้อง Blameless โทษ Process ไม่โทษคน
  • Timeline: สร้าง Timeline ละเอียด ช่วยหา Root Cause ง่ายขึ้น
  • 5 Whys: ถาม Why 5 ครั้งหา Root Cause ที่แท้จริง
  • Action Items: กำหนด Owner Deadline ติดตามทุก Sprint
  • Automate: ทุก Action Item ที่ Automate ได้ ให้ Automate

Post-mortem Analysis คืออะไร

วิเคราะห์ Incident หาสาเหตุ ป้องกันซ้ำ Blameless Timeline Root Cause 5 Whys Action Items Lessons Learned CI/CD Data Deploy Build Test Config