Incident.io
Incident.io Incident Management On-call Status Page Post-mortem Slack SRE DevOps MTTR Alert Escalation Timeline Blameless Production Operations Conference 2026
| Tool | Integration | On-call | Post-mortem | เหมาะกับ |
|---|---|---|---|---|
| Incident.io | Slack-native | Built-in | Automated | Slack-first Teams |
| PagerDuty | Multi-platform | Advanced | Manual | Enterprise |
| Opsgenie | Atlassian | Good | Jira Integration | Jira Users |
| FireHydrant | Slack + Web | Basic | Automated | Mid-size |
| Rootly | Slack-native | Basic | Automated | Startups |
Incident Workflow
# === Incident Management Workflow ===
# Incident.io Slack Integration
# /incident create "Payment API returning 500 errors"
# → Creates #inc-2026-0142 channel
# → Assigns Incident Commander
# → Notifies on-call engineer
# → Creates Status Page update
# → Starts Timeline tracking
# Severity Levels
# SEV1: Critical — Full outage, all users affected
# Response: Immediate, all hands, CEO notified
# Target MTTR: 30 minutes
#
# SEV2: Major — Partial outage, many users affected
# Response: On-call team + escalation
# Target MTTR: 1 hour
#
# SEV3: Minor — Degraded service, some users affected
# Response: On-call engineer
# Target MTTR: 4 hours
#
# SEV4: Low — Cosmetic issue, workaround available
# Response: Next business day
# Target MTTR: 24 hours
# Incident Roles
# Commander: ควบคุมสถานการณ์ ตัดสินใจ
# Communicator: อัพเดท Status Page แจ้ง Stakeholder
# Investigator: หาสาเหตุ แก้ไข
# Scribe: จด Timeline เหตุการณ์
from dataclasses import dataclass
from datetime import datetime
@dataclass
class Incident:
id: str
title: str
severity: str
status: str
commander: str
duration_min: int
users_affected: int
root_cause: str
incidents = [
Incident("INC-142", "Payment API 500 errors", "SEV1", "Resolved", "Alice", 25, 15000, "DB connection pool exhausted"),
Incident("INC-141", "Slow dashboard loading", "SEV3", "Resolved", "Bob", 120, 500, "N+1 query in reports"),
Incident("INC-140", "Email delivery delay", "SEV2", "Resolved", "Carol", 45, 8000, "SES rate limit hit"),
Incident("INC-139", "Login page CSS broken", "SEV4", "Resolved", "Dave", 180, 200, "CDN cache stale"),
Incident("INC-138", "API latency spike", "SEV2", "Resolved", "Alice", 35, 12000, "Redis cluster failover"),
]
print("=== Recent Incidents ===")
for inc in incidents:
print(f" [{inc.severity}] {inc.id}: {inc.title}")
print(f" Status: {inc.status} | Duration: {inc.duration_min}min")
print(f" Commander: {inc.commander} | Affected: {inc.users_affected:,}")
print(f" Root Cause: {inc.root_cause}")
On-call และ Escalation
# === On-call Schedule & Escalation ===
# On-call Schedule (YAML config)
# schedules:
# - name: primary-oncall
# timezone: Asia/Bangkok
# rotation: weekly
# start_day: monday
# start_time: "09:00"
# participants:
# - alice@company.com
# - bob@company.com
# - carol@company.com
# - dave@company.com
#
# escalation_policies:
# - name: production-critical
# steps:
# - targets: [primary-oncall]
# timeout_minutes: 5
# - targets: [secondary-oncall]
# timeout_minutes: 5
# - targets: [engineering-manager]
# timeout_minutes: 10
# - targets: [vp-engineering]
# timeout_minutes: 0 # final escalation
# Alert Routing Rules
# routes:
# - match:
# severity: critical
# service: payment
# notify: production-critical
# channels: [slack, phone, sms]
# - match:
# severity: warning
# notify: primary-oncall
# channels: [slack]
@dataclass
class OnCallShift:
engineer: str
schedule: str
start: str
end: str
incidents_handled: int
escalations: int
status: str
shifts = [
OnCallShift("Alice", "Primary", "Mon 09:00", "Mon 09:00 (+1w)", 3, 0, "Active"),
OnCallShift("Bob", "Primary", "Next Mon", "Mon +1w", 0, 0, "Upcoming"),
OnCallShift("Carol", "Secondary", "Mon 09:00", "Mon 09:00 (+1w)", 1, 1, "Active"),
OnCallShift("Dave", "Secondary", "Next Mon", "Mon +1w", 0, 0, "Upcoming"),
]
print("\n=== On-call Schedule ===")
for s in shifts:
print(f" [{s.status}] {s.engineer} — {s.schedule}")
print(f" Shift: {s.start} → {s.end}")
print(f" Incidents: {s.incidents_handled} | Escalations: {s.escalations}")
oncall_metrics = {
"Avg Incidents/week": "4.2",
"Avg Response Time": "3.5 minutes",
"Escalation Rate": "8%",
"MTTR (SEV1)": "28 minutes",
"MTTR (SEV2)": "52 minutes",
"False Positive Alert Rate": "12%",
"On-call Satisfaction": "3.8/5",
}
print(f"\n\nOn-call Metrics:")
for k, v in oncall_metrics.items():
print(f" {k}: {v}")
Post-mortem
# === Blameless Post-mortem ===
# Post-mortem Template
# ## Incident INC-142: Payment API 500 Errors
# **Date:** 2025-01-20
# **Duration:** 25 minutes (14:05 - 14:30)
# **Severity:** SEV1
# **Commander:** Alice
#
# ### Impact
# - 15,000 users unable to make payments
# - Estimated revenue loss: 150,000 THB
# - 500 support tickets created
#
# ### Timeline
# - 14:05 Alert: Payment API error rate > 5%
# - 14:07 On-call Alice acknowledged
# - 14:10 Incident channel created
# - 14:12 Root cause identified: DB connection pool
# - 14:15 Mitigation: Increased pool size
# - 14:20 Error rate dropping
# - 14:30 Fully resolved
#
# ### Root Cause (5 Whys)
# 1. Why did API return 500? → DB connections timed out
# 2. Why did connections time out? → Pool exhausted
# 3. Why was pool exhausted? → Spike in traffic
# 4. Why spike in traffic? → Marketing campaign launched
# 5. Why wasn't pool sized for spike? → No capacity planning
#
# ### Action Items
# - [ ] Increase DB pool to 200 (Owner: Dave, Due: Jan 22)
# - [ ] Add connection pool monitoring alert (Owner: Bob, Due: Jan 24)
# - [ ] Capacity planning for marketing campaigns (Owner: Carol, Due: Feb 1)
# - [ ] Load test before major campaigns (Owner: Alice, Due: Feb 1)
@dataclass
class ActionItem:
action: str
owner: str
due: str
priority: str
status: str
actions = [
ActionItem("Increase DB connection pool to 200", "Dave", "Jan 22", "P0", "Done"),
ActionItem("Add pool monitoring alert at 80%", "Bob", "Jan 24", "P0", "Done"),
ActionItem("Capacity planning process for campaigns", "Carol", "Feb 1", "P1", "In Progress"),
ActionItem("Load test playbook before major launches", "Alice", "Feb 1", "P1", "In Progress"),
ActionItem("Auto-scaling for DB connections", "Dave", "Feb 15", "P2", "Planned"),
]
print("Post-mortem Action Items:")
for a in actions:
print(f" [{a.status}] [{a.priority}] {a.action}")
print(f" Owner: {a.owner} | Due: {a.due}")
reliability_metrics = {
"Uptime (30d)": "99.95%",
"Total Incidents (30d)": "18",
"SEV1 Count": "2",
"Avg MTTR": "38 minutes",
"Post-mortems Written": "5",
"Action Items Completed": "12/15 (80%)",
"Recurring Incidents": "1 (down from 4)",
}
print(f"\n\nReliability Dashboard:")
for k, v in reliability_metrics.items():
print(f" {k}: {v}")
เคล็ดลับ
- Blameless: Post-mortem ไม่โทษคน เน้นปรับปรุงระบบ
- Automate: ใช้ Incident.io สร้าง Channel Timeline อัตโนมัติ
- Escalation: ตั้ง Escalation Policy ชัดเจน ไม่เกิน 5 นาที
- Action Items: ติดตาม Post-mortem Action Items ทุกสัปดาห์
- Practice: ซ้อม Incident Response ทุกเดือน Game Day
Incident.io คืออะไร
Platform Incident Management Slack Channel อัตโนมัติ Role Timeline Status Update Post-mortem On-call Escalation SRE DevOps MTTR
Incident Management Process มีขั้นตอนอย่างไร
Detection Alert Triage SEV1-4 Response Channel Role Mitigation แก้เบื้องต้น Resolution แก้ถาวร Post-mortem สาเหตุ Action Items
On-call Management ทำอย่างไร
Schedule หมุนเวียน Weekly Daily Escalation Policy 5 นาที PagerDuty Opsgenie Override Handoff ชดเชย Extra Pay Time Off
เขียน Post-mortem อย่างไร
Blameless ไม่โทษคน Timeline Impact Root Cause 5 Whys Contributing Factors Action Items Lessons Learned Review ติดตามทุกสัปดาห์
สรุป
Incident.io Tech Conference 2026 Incident Management On-call Escalation Post-mortem Blameless Slack Timeline Status Page MTTR SRE DevOps Production Operations
