Technology

CircleCI Orbs Post-mortem Analysis

circleci orbs post mortem analysis
CircleCI Orbs Post-mortem Analysis | SiamCafe Blog
2025-12-26· อ. บอม — SiamCafe.net· 8,215 คำ

Post-mortem + CircleCI

CircleCI Orbs Post-mortem Analysis Incident Timeline Root Cause Blameless Culture Action Items Prevention Deploy CI/CD Pipeline Monitoring Alert Automation

PhaseDurationActivitiesOwnerOutput
Detection0-5 minAlert fired, on-call notifiedMonitoringIncident declared
Triage5-15 minAssess impact, assign severityOn-call engineerSeverity level
Mitigation15-60 minRollback, hotfix, or workaroundIncident commanderService restored
Resolution1-4 hoursRoot cause fix deployedDev teamPermanent fix
Post-mortem1-3 days afterAnalysis meeting, documentTeam leadPost-mortem doc
Follow-up1-2 sprintsAction items completedAssigned ownersPrevention measures

Incident Analysis with CI/CD Data

# === CircleCI Incident Analysis ===

# CircleCI API — Get recent builds
# curl -H "Circle-Token: $CIRCLE_TOKEN" \
#   "https://circleci.com/api/v2/project/gh/org/repo/pipeline?branch=main" | \
#   jq '.items[:10] | .[] | {id: .id, state: .state, created_at: .created_at}'

# Get workflow details
# curl -H "Circle-Token: $CIRCLE_TOKEN" \
#   "https://circleci.com/api/v2/pipeline/$PIPELINE_ID/workflow" | \
#   jq '.items[] | {name: .name, status: .status, duration: .duration}'

# .circleci/config.yml with post-mortem orb
# orbs:
#   slack: circleci/slack@4.12.0
#   rollback: my-org/rollback@1.0.0
#
# jobs:
#   deploy:
#     steps:
#       - deploy-to-production
#       - health-check:
#           url: https://api.myapp.com/health
#           retries: 5
#           interval: 10s
#       - rollback/auto:
#           when: on_fail
#           version: previous
#       - slack/notify:
#           event: fail
#           channel: incidents
#           template: DEPLOY_FAILED

from dataclasses import dataclass

@dataclass
class IncidentTimeline:
    time: str
    event: str
    source: str
    impact: str

timeline = [
    IncidentTimeline("14:00", "Deploy #1234 triggered (main branch)", "CircleCI", "None"),
    IncidentTimeline("14:05", "Deploy completed, health check passed", "CircleCI", "None"),
    IncidentTimeline("14:12", "Error rate spike 5% → 25%", "Datadog alert", "Users see 500 errors"),
    IncidentTimeline("14:15", "On-call paged, incident declared SEV-2", "PagerDuty", "25% requests failing"),
    IncidentTimeline("14:20", "Root cause identified: DB migration issue", "Engineer", "Ongoing"),
    IncidentTimeline("14:25", "Rollback initiated via CircleCI", "CircleCI", "Reducing"),
    IncidentTimeline("14:30", "Rollback complete, error rate back to 1%", "Datadog", "Resolved"),
    IncidentTimeline("14:45", "All clear, monitoring continues", "Team", "None"),
]

print("=== Incident Timeline ===")
for t in timeline:
    print(f"  [{t.time}] {t.event}")
    print(f"    Source: {t.source} | Impact: {t.impact}")

Root Cause Analysis

# === 5 Whys Analysis ===

@dataclass
class WhyStep:
    level: int
    question: str
    answer: str

five_whys = [
    WhyStep(1, "Why did users see 500 errors?",
        "Database queries failed due to missing column"),
    WhyStep(2, "Why was the column missing?",
        "DB migration ran but was incompatible with old code"),
    WhyStep(3, "Why was incompatible migration deployed?",
        "Migration and code change were in separate deploys"),
    WhyStep(4, "Why were they in separate deploys?",
        "No process requiring migration + code in same PR"),
    WhyStep(5, "Why was there no such process?",
        "Deploy guidelines didn't cover DB migration ordering"),
]

print("=== 5 Whys ===")
for w in five_whys:
    print(f"  Why #{w.level}: {w.question}")
    print(f"    → {w.answer}")

# Action Items
@dataclass
class ActionItem:
    action: str
    priority: str
    owner: str
    deadline: str
    status: str

actions = [
    ActionItem("Add DB migration check to CI pipeline", "P0", "DevOps team", "This sprint", "In progress"),
    ActionItem("Create deploy runbook for DB changes", "P0", "Tech lead", "This sprint", "Todo"),
    ActionItem("Add integration test for DB schema", "P1", "Backend team", "Next sprint", "Todo"),
    ActionItem("Implement canary deploy (10% → 50% → 100%)", "P1", "DevOps team", "Next sprint", "Todo"),
    ActionItem("Add auto-rollback on error rate > 5%", "P1", "DevOps team", "Next sprint", "Todo"),
    ActionItem("Update post-mortem template with CI/CD section", "P2", "Team lead", "Next sprint", "Todo"),
]

print(f"\n\n=== Action Items ===")
for a in actions:
    print(f"  [{a.priority}] {a.action}")
    print(f"    Owner: {a.owner} | Deadline: {a.deadline} | Status: {a.status}")

Prevention Automation

# === Automated Prevention with CircleCI ===

# Canary Deploy Config
# jobs:
#   canary-deploy:
#     steps:
#       - deploy-canary:
#           percentage: 10
#       - wait: { duration: 5m }
#       - check-metrics:
#           error_threshold: 2%
#           latency_threshold: 500ms
#       - deploy-full:
#           when: metrics_pass
#       - rollback:
#           when: metrics_fail
#
# Health Check Orb
# orbs:
#   health: my-org/health-check@1.0.0
# jobs:
#   post-deploy:
#     steps:
#       - health/check:
#           endpoints:
#             - url: https://api.myapp.com/health
#               expected_status: 200
#             - url: https://api.myapp.com/db/health
#               expected_status: 200
#           timeout: 30s
#           retries: 3

@dataclass
class PreventionMeasure:
    measure: str
    trigger: str
    automation: str
    effectiveness: str

measures = [
    PreventionMeasure("Pre-deploy DB check", "Migration file detected in PR",
        "CI job validates migration compatibility", "Catches 80% of DB issues"),
    PreventionMeasure("Canary deploy", "Every production deploy",
        "10% traffic → check metrics → full deploy", "Limits blast radius to 10%"),
    PreventionMeasure("Auto rollback", "Error rate > 5% post-deploy",
        "CircleCI triggers rollback pipeline", "MTTR reduced from 30min to 5min"),
    PreventionMeasure("Health check gate", "After every deploy",
        "Hit /health endpoint, fail if not 200", "Catches service startup issues"),
    PreventionMeasure("Slack incident bot", "Deploy failure or rollback",
        "Auto-create incident channel, notify team", "Faster coordination"),
    PreventionMeasure("Post-mortem reminder", "3 days after incident",
        "Slack reminder to schedule post-mortem meeting", "Ensures follow-through"),
]

print("Prevention Measures:")
for m in measures:
    print(f"  [{m.measure}] Trigger: {m.trigger}")
    print(f"    Automation: {m.automation}")
    print(f"    Impact: {m.effectiveness}")

# DORA Metrics tracking
dora = {
    "Deployment Frequency": "6.2/day → target 10/day",
    "Lead Time for Changes": "2.5 hours → target 1 hour",
    "MTTR": "22 min → target 15 min (auto-rollback helps)",
    "Change Failure Rate": "8% → target 5% (canary deploy helps)",
}

print(f"\n\nDORA Metrics:")
for k, v in dora.items():
    print(f"  [{k}]: {v}")

เคล็ดลับ

Post-mortem Analysis คืออะไร

วิเคราะห์ Incident หาสาเหตุ ป้องกันซ้ำ Blameless Timeline Root Cause 5 Whys Action Items Lessons Learned CI/CD Data Deploy Build Test Config

ใช้ CircleCI Data วิเคราะห์อย่างไร

Build Timeline Deploy Config Changes Test Results Workflow Duration API Dashboard Deploy Frequency Failure Rate MTTR Notification Rollback Automation

สร้าง Post-mortem Template อย่างไร

Incident Summary Timeline Impact Root Cause Contributing Factors What Went Well Wrong Action Items Owner Deadline Lessons Learned Follow-up Date

ป้องกัน Incident ซ้ำอย่างไร

Automated Tests Integration E2E CI Pipeline Canary Deploy Health Check Auto Rollback Error Rate Monitoring Alert Review Action Items Sprint

สรุป

CircleCI Orbs Post-mortem Analysis Incident Timeline Root Cause 5 Whys Blameless Action Items Canary Deploy Auto Rollback Health Check DORA Metrics Prevention

📖 บทความที่เกี่ยวข้อง

CircleCI Orbs Team Productivityอ่านบทความ → CircleCI Orbs DNS Managementอ่านบทความ → Mintlify Docs Post-mortem Analysisอ่านบทความ → CircleCI Orbs CDN Configurationอ่านบทความ → CircleCI Orbs Cloud Native Designอ่านบทความ →

📚 ดูบทความทั้งหมด →