Incident.io Tech Conference 2026

Incident.io

Incident.io Incident Management On-call Status Page Post-mortem Slack SRE DevOps MTTR Alert Escalation Timeline Blameless Production Operations Conference 2026

Tool	Integration	On-call	Post-mortem	เหมาะกับ
Incident.io	Slack-native	Built-in	Automated	Slack-first Teams
PagerDuty	Multi-platform	Advanced	Manual	Enterprise
Opsgenie	Atlassian	Good	Jira Integration	Jira Users
FireHydrant	Slack + Web	Basic	Automated	Mid-size
Rootly	Slack-native	Basic	Automated	Startups

Incident Workflow

# === Incident Management Workflow ===

# Incident.io Slack Integration
# /incident create "Payment API returning 500 errors"
# → Creates #inc-2026-0142 channel
# → Assigns Incident Commander
# → Notifies on-call engineer
# → Creates Status Page update
# → Starts Timeline tracking

# Severity Levels
# SEV1: Critical — Full outage, all users affected
#   Response: Immediate, all hands, CEO notified
#   Target MTTR: 30 minutes
#
# SEV2: Major — Partial outage, many users affected
#   Response: On-call team + escalation
#   Target MTTR: 1 hour
#
# SEV3: Minor — Degraded service, some users affected
#   Response: On-call engineer
#   Target MTTR: 4 hours
#
# SEV4: Low — Cosmetic issue, workaround available
#   Response: Next business day
#   Target MTTR: 24 hours

# Incident Roles
# Commander: ควบคุมสถานการณ์ ตัดสินใจ
# Communicator: อัพเดท Status Page แจ้ง Stakeholder
# Investigator: หาสาเหตุ แก้ไข
# Scribe: จด Timeline เหตุการณ์

from dataclasses import dataclass
from datetime import datetime

@dataclass
class Incident:
    id: str
    title: str
    severity: str
    status: str
    commander: str
    duration_min: int
    users_affected: int
    root_cause: str

incidents = [
    Incident("INC-142", "Payment API 500 errors", "SEV1", "Resolved", "Alice", 25, 15000, "DB connection pool exhausted"),
    Incident("INC-141", "Slow dashboard loading", "SEV3", "Resolved", "Bob", 120, 500, "N+1 query in reports"),
    Incident("INC-140", "Email delivery delay", "SEV2", "Resolved", "Carol", 45, 8000, "SES rate limit hit"),
    Incident("INC-139", "Login page CSS broken", "SEV4", "Resolved", "Dave", 180, 200, "CDN cache stale"),
    Incident("INC-138", "API latency spike", "SEV2", "Resolved", "Alice", 35, 12000, "Redis cluster failover"),
]

print("=== Recent Incidents ===")
for inc in incidents:
    print(f"  [{inc.severity}] {inc.id}: {inc.title}")
    print(f"    Status: {inc.status} | Duration: {inc.duration_min}min")
    print(f"    Commander: {inc.commander} | Affected: {inc.users_affected:,}")
    print(f"    Root Cause: {inc.root_cause}")

On-call และ Escalation

# === On-call Schedule & Escalation ===

# On-call Schedule (YAML config)
# schedules:
#   - name: primary-oncall
#     timezone: Asia/Bangkok
#     rotation: weekly
#     start_day: monday
#     start_time: "09:00"
#     participants:
#       - alice@company.com
#       - bob@company.com
#       - carol@company.com
#       - dave@company.com
#
# escalation_policies:
#   - name: production-critical
#     steps:
#       - targets: [primary-oncall]
#         timeout_minutes: 5
#       - targets: [secondary-oncall]
#         timeout_minutes: 5
#       - targets: [engineering-manager]
#         timeout_minutes: 10
#       - targets: [vp-engineering]
#         timeout_minutes: 0  # final escalation

# Alert Routing Rules
# routes:
#   - match:
#       severity: critical
#       service: payment
#     notify: production-critical
#     channels: [slack, phone, sms]
#   - match:
#       severity: warning
#     notify: primary-oncall
#     channels: [slack]

@dataclass
class OnCallShift:
    engineer: str
    schedule: str
    start: str
    end: str
    incidents_handled: int
    escalations: int
    status: str

shifts = [
    OnCallShift("Alice", "Primary", "Mon 09:00", "Mon 09:00 (+1w)", 3, 0, "Active"),
    OnCallShift("Bob", "Primary", "Next Mon", "Mon +1w", 0, 0, "Upcoming"),
    OnCallShift("Carol", "Secondary", "Mon 09:00", "Mon 09:00 (+1w)", 1, 1, "Active"),
    OnCallShift("Dave", "Secondary", "Next Mon", "Mon +1w", 0, 0, "Upcoming"),
]

print("\n=== On-call Schedule ===")
for s in shifts:
    print(f"  [{s.status}] {s.engineer} — {s.schedule}")
    print(f"    Shift: {s.start} → {s.end}")
    print(f"    Incidents: {s.incidents_handled} | Escalations: {s.escalations}")

oncall_metrics = {
    "Avg Incidents/week": "4.2",
    "Avg Response Time": "3.5 minutes",
    "Escalation Rate": "8%",
    "MTTR (SEV1)": "28 minutes",
    "MTTR (SEV2)": "52 minutes",
    "False Positive Alert Rate": "12%",
    "On-call Satisfaction": "3.8/5",
}

print(f"\n\nOn-call Metrics:")
for k, v in oncall_metrics.items():
    print(f"  {k}: {v}")

Post-mortem

# === Blameless Post-mortem ===

# Post-mortem Template
# ## Incident INC-142: Payment API 500 Errors
# **Date:** 2025-01-20
# **Duration:** 25 minutes (14:05 - 14:30)
# **Severity:** SEV1
# **Commander:** Alice
#
# ### Impact
# - 15,000 users unable to make payments
# - Estimated revenue loss: 150,000 THB
# - 500 support tickets created
#
# ### Timeline
# - 14:05 Alert: Payment API error rate > 5%
# - 14:07 On-call Alice acknowledged
# - 14:10 Incident channel created
# - 14:12 Root cause identified: DB connection pool
# - 14:15 Mitigation: Increased pool size
# - 14:20 Error rate dropping
# - 14:30 Fully resolved
#
# ### Root Cause (5 Whys)
# 1. Why did API return 500? → DB connections timed out
# 2. Why did connections time out? → Pool exhausted
# 3. Why was pool exhausted? → Spike in traffic
# 4. Why spike in traffic? → Marketing campaign launched
# 5. Why wasn't pool sized for spike? → No capacity planning
#
# ### Action Items
# - [ ] Increase DB pool to 200 (Owner: Dave, Due: Jan 22)
# - [ ] Add connection pool monitoring alert (Owner: Bob, Due: Jan 24)
# - [ ] Capacity planning for marketing campaigns (Owner: Carol, Due: Feb 1)
# - [ ] Load test before major campaigns (Owner: Alice, Due: Feb 1)

@dataclass
class ActionItem:
    action: str
    owner: str
    due: str
    priority: str
    status: str

actions = [
    ActionItem("Increase DB connection pool to 200", "Dave", "Jan 22", "P0", "Done"),
    ActionItem("Add pool monitoring alert at 80%", "Bob", "Jan 24", "P0", "Done"),
    ActionItem("Capacity planning process for campaigns", "Carol", "Feb 1", "P1", "In Progress"),
    ActionItem("Load test playbook before major launches", "Alice", "Feb 1", "P1", "In Progress"),
    ActionItem("Auto-scaling for DB connections", "Dave", "Feb 15", "P2", "Planned"),
]

print("Post-mortem Action Items:")
for a in actions:
    print(f"  [{a.status}] [{a.priority}] {a.action}")
    print(f"    Owner: {a.owner} | Due: {a.due}")

reliability_metrics = {
    "Uptime (30d)": "99.95%",
    "Total Incidents (30d)": "18",
    "SEV1 Count": "2",
    "Avg MTTR": "38 minutes",
    "Post-mortems Written": "5",
    "Action Items Completed": "12/15 (80%)",
    "Recurring Incidents": "1 (down from 4)",
}

print(f"\n\nReliability Dashboard:")
for k, v in reliability_metrics.items():
    print(f"  {k}: {v}")

เคล็ดลับ

Blameless: Post-mortem ไม่โทษคน เน้นปรับปรุงระบบ
Automate: ใช้ Incident.io สร้าง Channel Timeline อัตโนมัติ
Escalation: ตั้ง Escalation Policy ชัดเจน ไม่เกิน 5 นาที
Action Items: ติดตาม Post-mortem Action Items ทุกสัปดาห์
Practice: ซ้อม Incident Response ทุกเดือน Game Day

Incident.io คืออะไร

Platform Incident Management Slack Channel อัตโนมัติ Role Timeline Status Update Post-mortem On-call Escalation SRE DevOps MTTR