CircleCI Orbs Post-mortem Analysis — วิเคราะห์

Post-mortem + CircleCI

CircleCI Orbs Post-mortem Analysis Incident Timeline Root Cause Blameless Culture Action Items Prevention Deploy CI/CD Pipeline Monitoring Alert Automation

เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: TypeScript Zod Feature Flag Management

Phase	Duration	Activities	Owner	Output
Detection	0-5 min	Alert fired, on-call notified	Monitoring	Incident declared
Triage	5-15 min	Assess impact, assign severity	On-call engineer	Severity level
Mitigation	15-60 min	Rollback, hotfix, or workaround	Incident commander	Service restored
Resolution	1-4 hours	Root cause fix deployed	Dev team	Permanent fix
Post-mortem	1-3 days after	Analysis meeting, document	Team lead	Post-mortem doc
Follow-up	1-2 sprints	Action items completed	Assigned owners	Prevention measures

Incident Analysis with CI/CD Data

# === CircleCI Incident Analysis ===



# CircleCI API — Get recent builds

# curl -H "Circle-Token: $CIRCLE_TOKEN" \

#   "https://circleci.com/api/v2/project/gh/org/repo/pipeline?branch=main" | \

#   jq '.items[:10] | .[] | {id: .id, state: .state, created_at: .created_at}'



# Get workflow details

# curl -H "Circle-Token: $CIRCLE_TOKEN" \

#   "https://circleci.com/api/v2/pipeline/$PIPELINE_ID/workflow" | \

#   jq '.items[] | {name: .name, status: .status, duration: .duration}'



# .circleci/config.yml with post-mortem orb

# orbs:

#   slack: circleci/slack@4.12.0

#   rollback: my-org/rollback@1.0.0

#

# jobs:

#   deploy:

#     steps:

#       - deploy-to-production

#       - health-check:

#           url: https://api.myapp.com/health

#           retries: 5

#           interval: 10s

#       - rollback/auto:

#           when: on_fail

#           version: previous

#       - slack/notify:

#           event: fail

#           channel: incidents

#           template: DEPLOY_FAILED



from dataclasses import dataclass



@dataclass

class IncidentTimeline:

    time: str

    event: str

    source: str

    impact: str



timeline = [

    IncidentTimeline("14:00", "Deploy #1234 triggered (main branch)", "CircleCI", "None"),

    IncidentTimeline("14:05", "Deploy completed, health check passed", "CircleCI", "None"),

    IncidentTimeline("14:12", "Error rate spike 5% → 25%", "Datadog alert", "Users see 500 errors"),

    IncidentTimeline("14:15", "On-call paged, incident declared SEV-2", "PagerDuty", "25% requests failing"),

    IncidentTimeline("14:20", "Root cause identified: DB migration issue", "Engineer", "Ongoing"),

    IncidentTimeline("14:25", "Rollback initiated via CircleCI", "CircleCI", "Reducing"),

    IncidentTimeline("14:30", "Rollback complete, error rate back to 1%", "Datadog", "Resolved"),

    IncidentTimeline("14:45", "All clear, monitoring continues", "Team", "None"),

]



print("=== Incident Timeline ===")

for t in timeline:

    print(f"  [{t.time}] {t.event}")

    print(f"    Source: {t.source} | Impact: {t.impact}")

Root Cause Analysis

# === 5 Whys Analysis ===



@dataclass

class WhyStep:

    level: int

    question: str

    answer: str



five_whys = [

    WhyStep(1, "Why did users see 500 errors?",

        "Database queries failed due to missing column"),

    WhyStep(2, "Why was the column missing?",

        "DB migration ran but was incompatible with old code"),

    WhyStep(3, "Why was incompatible migration deployed?",

        "Migration and code change were in separate deploys"),

    WhyStep(4, "Why were they in separate deploys?",

        "No process requiring migration + code in same PR"),

    WhyStep(5, "Why was there no such process?",

        "Deploy guidelines didn't cover DB migration ordering"),

]



print("=== 5 Whys ===")

for w in five_whys:

    print(f"  Why #{w.level}: {w.question}")

    print(f"    → {w.answer}")



# Action Items

@dataclass

class ActionItem:

    action: str

    priority: str

    owner: str

    deadline: str

    status: str



actions = [

    ActionItem("Add DB migration check to CI pipeline", "P0", "DevOps team", "This sprint", "In progress"),

    ActionItem("Create deploy runbook for DB changes", "P0", "Tech lead", "This sprint", "Todo"),

    ActionItem("Add integration test for DB schema", "P1", "Backend team", "Next sprint", "Todo"),

    ActionItem("Implement canary deploy (10% → 50% → 100%)", "P1", "DevOps team", "Next sprint", "Todo"),

    ActionItem("Add auto-rollback on error rate > 5%", "P1", "DevOps team", "Next sprint", "Todo"),

    ActionItem("Update post-mortem template with CI/CD section", "P2", "Team lead", "Next sprint", "Todo"),

]



print(f"\n\n=== Action Items ===")

for a in actions:

    print(f"  [{a.priority}] {a.action}")

    print(f"    Owner: {a.owner} | Deadline: {a.deadline} | Status: {a.status}")

Prevention Automation

# === Automated Prevention with CircleCI ===



# Canary Deploy Config

# jobs:

#   canary-deploy:

#     steps:

#       - deploy-canary:

#           percentage: 10

#       - wait: { duration: 5m }

#       - check-metrics:

#           error_threshold: 2%

#           latency_threshold: 500ms

#       - deploy-full:

#           when: metrics_pass

#       - rollback:

#           when: metrics_fail

#

# Health Check Orb

# orbs:

#   health: my-org/health-check@1.0.0

# jobs:

#   post-deploy:

#     steps:

#       - health/check:

#           endpoints:

#             - url: https://api.myapp.com/health

#               expected_status: 200

#             - url: https://api.myapp.com/db/health

#               expected_status: 200

#           timeout: 30s

#           retries: 3



@dataclass

class PreventionMeasure:

    measure: str

    trigger: str

    automation: str

    effectiveness: str



measures = [

    PreventionMeasure("Pre-deploy DB check", "Migration file detected in PR",

        "CI job validates migration compatibility", "Catches 80% of DB issues"),

    PreventionMeasure("Canary deploy", "Every production deploy",

        "10% traffic → check metrics → full deploy", "Limits blast radius to 10%"),

    PreventionMeasure("Auto rollback", "Error rate > 5% post-deploy",

        "CircleCI triggers rollback pipeline", "MTTR reduced from 30min to 5min"),

    PreventionMeasure("Health check gate", "After every deploy",

        "Hit /health endpoint, fail if not 200", "Catches service startup issues"),

    PreventionMeasure("Slack incident bot", "Deploy failure or rollback",

        "Auto-create incident channel, notify team", "Faster coordination"),

    PreventionMeasure("Post-mortem reminder", "3 days after incident",

        "Slack reminder to schedule post-mortem meeting", "Ensures follow-through"),

]



print("Prevention Measures:")

for m in measures:

    print(f"  [{m.measure}] Trigger: {m.trigger}")

    print(f"    Automation: {m.automation}")

    print(f"    Impact: {m.effectiveness}")



# DORA Metrics tracking

dora = {

    "Deployment Frequency": "6.2/day → target 10/day",

    "Lead Time for Changes": "2.5 hours → target 1 hour",

    "MTTR": "22 min → target 15 min (auto-rollback helps)",

    "Change Failure Rate": "8% → target 5% (canary deploy helps)",

}



print(f"\n\nDORA Metrics:")

for k, v in dora.items():

    print(f"  [{k}]: {v}")

เคล็ดลับ

Blameless: Post-mortem ต้อง Blameless โทษ Process ไม่โทษคน
Timeline: สร้าง Timeline ละเอียด ช่วยหา Root Cause ง่ายขึ้น
5 Whys: ถาม Why 5 ครั้งหา Root Cause ที่แท้จริง
Action Items: กำหนด Owner Deadline ติดตามทุก Sprint
Automate: ทุก Action Item ที่ Automate ได้ ให้ Automate

Post-mortem Analysis คืออะไร

วิเคราะห์ Incident หาสาเหตุ ป้องกันซ้ำ Blameless Timeline Root Cause 5 Whys Action Items Lessons Learned CI/CD Data Deploy Build Test Config

แนะนำเพิ่มเติม — XM Signal

เนื้อหาเกี่ยวข้อง — บทความที่เกี่ยวข้อง: LangChain Agent Best Practices ที่ต้องรู้ —

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน Whisper Speech Business Continuity — คู่มือฉบับสมบูรณ์ 2026