Post-mortem + CircleCI
CircleCI Orbs Post-mortem Analysis Incident Timeline Root Cause Blameless Culture Action Items Prevention Deploy CI/CD Pipeline Monitoring Alert Automation
| Phase | Duration | Activities | Owner | Output |
|---|---|---|---|---|
| Detection | 0-5 min | Alert fired, on-call notified | Monitoring | Incident declared |
| Triage | 5-15 min | Assess impact, assign severity | On-call engineer | Severity level |
| Mitigation | 15-60 min | Rollback, hotfix, or workaround | Incident commander | Service restored |
| Resolution | 1-4 hours | Root cause fix deployed | Dev team | Permanent fix |
| Post-mortem | 1-3 days after | Analysis meeting, document | Team lead | Post-mortem doc |
| Follow-up | 1-2 sprints | Action items completed | Assigned owners | Prevention measures |
Incident Analysis with CI/CD Data
# === CircleCI Incident Analysis ===
# CircleCI API — Get recent builds
# curl -H "Circle-Token: $CIRCLE_TOKEN" \
# "https://circleci.com/api/v2/project/gh/org/repo/pipeline?branch=main" | \
# jq '.items[:10] | .[] | {id: .id, state: .state, created_at: .created_at}'
# Get workflow details
# curl -H "Circle-Token: $CIRCLE_TOKEN" \
# "https://circleci.com/api/v2/pipeline/$PIPELINE_ID/workflow" | \
# jq '.items[] | {name: .name, status: .status, duration: .duration}'
# .circleci/config.yml with post-mortem orb
# orbs:
# slack: circleci/slack@4.12.0
# rollback: my-org/rollback@1.0.0
#
# jobs:
# deploy:
# steps:
# - deploy-to-production
# - health-check:
# url: https://api.myapp.com/health
# retries: 5
# interval: 10s
# - rollback/auto:
# when: on_fail
# version: previous
# - slack/notify:
# event: fail
# channel: incidents
# template: DEPLOY_FAILED
from dataclasses import dataclass
@dataclass
class IncidentTimeline:
time: str
event: str
source: str
impact: str
timeline = [
IncidentTimeline("14:00", "Deploy #1234 triggered (main branch)", "CircleCI", "None"),
IncidentTimeline("14:05", "Deploy completed, health check passed", "CircleCI", "None"),
IncidentTimeline("14:12", "Error rate spike 5% → 25%", "Datadog alert", "Users see 500 errors"),
IncidentTimeline("14:15", "On-call paged, incident declared SEV-2", "PagerDuty", "25% requests failing"),
IncidentTimeline("14:20", "Root cause identified: DB migration issue", "Engineer", "Ongoing"),
IncidentTimeline("14:25", "Rollback initiated via CircleCI", "CircleCI", "Reducing"),
IncidentTimeline("14:30", "Rollback complete, error rate back to 1%", "Datadog", "Resolved"),
IncidentTimeline("14:45", "All clear, monitoring continues", "Team", "None"),
]
print("=== Incident Timeline ===")
for t in timeline:
print(f" [{t.time}] {t.event}")
print(f" Source: {t.source} | Impact: {t.impact}")
Root Cause Analysis
# === 5 Whys Analysis ===
@dataclass
class WhyStep:
level: int
question: str
answer: str
five_whys = [
WhyStep(1, "Why did users see 500 errors?",
"Database queries failed due to missing column"),
WhyStep(2, "Why was the column missing?",
"DB migration ran but was incompatible with old code"),
WhyStep(3, "Why was incompatible migration deployed?",
"Migration and code change were in separate deploys"),
WhyStep(4, "Why were they in separate deploys?",
"No process requiring migration + code in same PR"),
WhyStep(5, "Why was there no such process?",
"Deploy guidelines didn't cover DB migration ordering"),
]
print("=== 5 Whys ===")
for w in five_whys:
print(f" Why #{w.level}: {w.question}")
print(f" → {w.answer}")
# Action Items
@dataclass
class ActionItem:
action: str
priority: str
owner: str
deadline: str
status: str
actions = [
ActionItem("Add DB migration check to CI pipeline", "P0", "DevOps team", "This sprint", "In progress"),
ActionItem("Create deploy runbook for DB changes", "P0", "Tech lead", "This sprint", "Todo"),
ActionItem("Add integration test for DB schema", "P1", "Backend team", "Next sprint", "Todo"),
ActionItem("Implement canary deploy (10% → 50% → 100%)", "P1", "DevOps team", "Next sprint", "Todo"),
ActionItem("Add auto-rollback on error rate > 5%", "P1", "DevOps team", "Next sprint", "Todo"),
ActionItem("Update post-mortem template with CI/CD section", "P2", "Team lead", "Next sprint", "Todo"),
]
print(f"\n\n=== Action Items ===")
for a in actions:
print(f" [{a.priority}] {a.action}")
print(f" Owner: {a.owner} | Deadline: {a.deadline} | Status: {a.status}")
Prevention Automation
# === Automated Prevention with CircleCI ===
# Canary Deploy Config
# jobs:
# canary-deploy:
# steps:
# - deploy-canary:
# percentage: 10
# - wait: { duration: 5m }
# - check-metrics:
# error_threshold: 2%
# latency_threshold: 500ms
# - deploy-full:
# when: metrics_pass
# - rollback:
# when: metrics_fail
#
# Health Check Orb
# orbs:
# health: my-org/health-check@1.0.0
# jobs:
# post-deploy:
# steps:
# - health/check:
# endpoints:
# - url: https://api.myapp.com/health
# expected_status: 200
# - url: https://api.myapp.com/db/health
# expected_status: 200
# timeout: 30s
# retries: 3
@dataclass
class PreventionMeasure:
measure: str
trigger: str
automation: str
effectiveness: str
measures = [
PreventionMeasure("Pre-deploy DB check", "Migration file detected in PR",
"CI job validates migration compatibility", "Catches 80% of DB issues"),
PreventionMeasure("Canary deploy", "Every production deploy",
"10% traffic → check metrics → full deploy", "Limits blast radius to 10%"),
PreventionMeasure("Auto rollback", "Error rate > 5% post-deploy",
"CircleCI triggers rollback pipeline", "MTTR reduced from 30min to 5min"),
PreventionMeasure("Health check gate", "After every deploy",
"Hit /health endpoint, fail if not 200", "Catches service startup issues"),
PreventionMeasure("Slack incident bot", "Deploy failure or rollback",
"Auto-create incident channel, notify team", "Faster coordination"),
PreventionMeasure("Post-mortem reminder", "3 days after incident",
"Slack reminder to schedule post-mortem meeting", "Ensures follow-through"),
]
print("Prevention Measures:")
for m in measures:
print(f" [{m.measure}] Trigger: {m.trigger}")
print(f" Automation: {m.automation}")
print(f" Impact: {m.effectiveness}")
# DORA Metrics tracking
dora = {
"Deployment Frequency": "6.2/day → target 10/day",
"Lead Time for Changes": "2.5 hours → target 1 hour",
"MTTR": "22 min → target 15 min (auto-rollback helps)",
"Change Failure Rate": "8% → target 5% (canary deploy helps)",
}
print(f"\n\nDORA Metrics:")
for k, v in dora.items():
print(f" [{k}]: {v}")
เคล็ดลับ
- Blameless: Post-mortem ต้อง Blameless โทษ Process ไม่โทษคน
- Timeline: สร้าง Timeline ละเอียด ช่วยหา Root Cause ง่ายขึ้น
- 5 Whys: ถาม Why 5 ครั้งหา Root Cause ที่แท้จริง
- Action Items: กำหนด Owner Deadline ติดตามทุก Sprint
- Automate: ทุก Action Item ที่ Automate ได้ ให้ Automate
Post-mortem Analysis คืออะไร
วิเคราะห์ Incident หาสาเหตุ ป้องกันซ้ำ Blameless Timeline Root Cause 5 Whys Action Items Lessons Learned CI/CD Data Deploy Build Test Config
ใช้ CircleCI Data วิเคราะห์อย่างไร
Build Timeline Deploy Config Changes Test Results Workflow Duration API Dashboard Deploy Frequency Failure Rate MTTR Notification Rollback Automation
สร้าง Post-mortem Template อย่างไร
Incident Summary Timeline Impact Root Cause Contributing Factors What Went Well Wrong Action Items Owner Deadline Lessons Learned Follow-up Date
ป้องกัน Incident ซ้ำอย่างไร
Automated Tests Integration E2E CI Pipeline Canary Deploy Health Check Auto Rollback Error Rate Monitoring Alert Review Action Items Sprint
สรุป
CircleCI Orbs Post-mortem Analysis Incident Timeline Root Cause 5 Whys Blameless Action Items Canary Deploy Auto Rollback Health Check DORA Metrics Prevention
