Incident.io Internal Developer Platform

Incident.io + IDP

Incident.io Internal Developer Platform Incident Management Slack Automation Postmortem Service Catalog Backstage Production

Feature	Incident.io	PagerDuty	Opsgenie
Slack Integration	Native (Slack-first)	Add-on	Add-on
Auto Channel	สร้าง Channel อัตโนมัติ	ไม่มี	ไม่มี
Postmortem	Auto-generated from Timeline	Manual	Manual
Service Catalog	Built-in	Built-in	Basic
Workflow	Visual builder + API	Event Orchestration	Basic
Status Page	Built-in	Separate product	Built-in

Incident Workflow

# === Incident.io Workflow ===

# Slack command to declare incident
# /incident new
#   Summary: "API response time > 5s for all users"
#   Severity: P1 (Critical)
#   Affected: API Gateway, Payment Service
#
# Automatic actions (P1):
# 1. Create #inc-2024-0142-api-degradation channel
# 2. Invite on-call engineers from rotation
# 3. Assign Incident Commander (IC)
# 4. Post to #incidents channel
# 5. Update Status Page → "Investigating"
# 6. Create Jira ticket INC-142
# 7. Start incident timer
# 8. Page backup engineer if no response in 10 min

# Severity definitions
# severity_config:
#   P1_Critical:
#     description: "Service down, all users affected"
#     response_time: "5 minutes"
#     auto_page: true
#     status_page: true
#     exec_notify: true
#   P2_High:
#     description: "Degraded, >50% users affected"
#     response_time: "15 minutes"
#     auto_page: true
#     status_page: true
#   P3_Medium:
#     description: "Partial impact, <50% users"
#     response_time: "1 hour"
#     auto_page: false
#     status_page: false
#   P4_Low:
#     description: "Minor issue, workaround exists"
#     response_time: "4 hours"
#     auto_page: false

from dataclasses import dataclass

@dataclass
class IncidentRole:
    role: str
    responsibility: str
    who: str
    when: str

roles = [
    IncidentRole("Incident Commander (IC)",
        "ตัดสินใจ ประสานงาน มอบหมายงาน สรุปสถานการณ์",
        "Senior Engineer / Engineering Manager",
        "ทุก P1 P2 Incident"),
    IncidentRole("Communication Lead",
        "อัพเดท Status Page แจ้ง Stakeholder ลูกค้า",
        "Product Manager / Support Lead",
        "ทุก P1 P2 ที่กระทบลูกค้า"),
    IncidentRole("Subject Matter Expert",
        "แก้ปัญหาทางเทคนิค Debug Deploy Fix",
        "Engineer ที่รู้ระบบนั้นดีที่สุด",
        "ทุก Incident ตาม Affected Service"),
    IncidentRole("Scribe",
        "บันทึก Timeline Actions Decisions",
        "Junior Engineer / Rotation",
        "P1 Incident เพื่อสร้าง Postmortem"),
]

print("=== Incident Roles ===")
for r in roles:
    print(f"  [{r.role}]")
    print(f"    Responsibility: {r.responsibility}")
    print(f"    Who: {r.who}")
    print(f"    When: {r.when}")

Service Catalog

# === Service Catalog ===

# Incident.io Catalog or Backstage catalog-info.yaml
# apiVersion: backstage.io/v1alpha1
# kind: Component
# metadata:
#   name: api-gateway
#   description: Main API Gateway (Kong)
#   annotations:
#     incident.io/service: api-gateway
#     pagerduty.com/service-id: PXXXXXX
#     github.com/project-slug: myorg/api-gateway
# spec:
#   type: service
#   lifecycle: production
#   owner: platform-team
#   dependsOn:
#     - component:auth-service
#     - component:payment-service
#   providesApis:
#     - api-gateway-rest

@dataclass
class CatalogService:
    service: str
    owner: str
    tier: str
    oncall: str
    dependencies: str
    runbook: str

services = [
    CatalogService("API Gateway",
        "Platform Team", "Tier 1 (Critical)",
        "platform-oncall rotation",
        "Auth Service, Rate Limiter, Config Service",
        "https://wiki/runbooks/api-gateway"),
    CatalogService("Payment Service",
        "Payment Team", "Tier 1 (Critical)",
        "payment-oncall rotation",
        "Stripe API, Database (PG), Message Queue",
        "https://wiki/runbooks/payment"),
    CatalogService("User Service",
        "Backend Team", "Tier 2 (High)",
        "backend-oncall rotation",
        "Database (PG), Cache (Redis), Auth Service",
        "https://wiki/runbooks/user-service"),
    CatalogService("Notification Service",
        "Backend Team", "Tier 3 (Medium)",
        "backend-oncall rotation",
        "Email (SES), SMS (Twilio), Push (FCM)",
        "https://wiki/runbooks/notification"),
]

print("=== Service Catalog ===")
for s in services:
    print(f"  [{s.service}] Owner: {s.owner} | Tier: {s.tier}")
    print(f"    On-call: {s.oncall}")
    print(f"    Dependencies: {s.dependencies}")
    print(f"    Runbook: {s.runbook}")

Postmortem Template

# === Postmortem Template ===

@dataclass
class PostmortemSection:
    section: str
    content: str
    auto_generated: bool
    owner: str

sections = [
    PostmortemSection("Summary",
        "สรุป 2-3 ประโยค เกิดอะไร กระทบอะไร นานเท่าไหร่",
        True, "Incident Commander"),
    PostmortemSection("Timeline",
        "ลำดับเหตุการณ์ ตั้งแต่เริ่มจนจบ จาก Slack Messages",
        True, "Auto from Incident.io"),
    PostmortemSection("Impact",
        "จำนวน Users ที่กระทบ Revenue Loss Error Rate Duration",
        False, "IC + Product Manager"),
    PostmortemSection("Root Cause",
        "สาเหตุที่แท้จริง ทำไมถึงเกิด ใช้ 5 Whys",
        False, "Subject Matter Expert"),
    PostmortemSection("Action Items",
        "สิ่งที่ต้องทำป้องกัน พร้อม Owner Deadline Priority",
        False, "IC + Team"),
    PostmortemSection("Lessons Learned",
        "บทเรียน อะไรทำได้ดี อะไรปรับปรุง",
        False, "ทุกู้คืนที่เกี่ยวข้อง"),
]

print("=== Postmortem Template ===")
for s in sections:
    auto = "Auto" if s.auto_generated else "Manual"
    print(f"  [{s.section}] ({auto}) Owner: {s.owner}")
    print(f"    Content: {s.content}")

เคล็ดลับ

Slack-first: ใช้ Incident.io ใน Slack ทุกู้คืนอยู่แล้ว ไม่ต้องสลับเครื่องมือ
Severity: กำหนด Severity ชัดเจน P1-P4 ทุกู้คืนเข้าใจตรงกัน
Blameless: Postmortem ไม่โทษคน โทษ System ปรับปรุง Process
Action Items: ทุก Action Item ต้องมี Owner Deadline ติดตามจนเสร็จ
Catalog: อัพเดท Service Catalog ให้ทันสมัย ใครเป็น Owner On-call

Best Practices สำหรับนักพัฒนา

การเขียนโค้ดที่ดีไม่ใช่แค่ทำให้โปรแกรมทำงานได้ แต่ต้องเขียนให้อ่านง่าย ดูแลรักษาง่าย และ Scale ได้ หลัก SOLID Principles เป็นพื้นฐานสำคัญที่นักพัฒนาทุกู้คืนควรเข้าใจ ได้แก่ Single Responsibility ที่แต่ละ Class ทำหน้าที่เดียว Open-Closed ที่เปิดให้ขยายแต่ปิดการแก้ไข Liskov Substitution ที่ Subclass ต้องใช้แทน Parent ได้ Interface Segregation ที่แยก Interface ให้เล็ก และ Dependency Inversion ที่พึ่งพา Abstraction ไม่ใช่ Implementation

อ่านเพิ่ม: SRE คืออะไร? Site Reliability Engineering แนวคิดจาก Google ส · อ่านเพิ่ม: Elixir Phoenix LiveView Infrastructure as Code — คู่มือฉบับส · อ่านเพิ่ม: Elixir Phoenix LiveView Internal Developer Platform — คู่มือ

เรื่อง Testing ก็ขาดไม่ได้ ควรเขียน Unit Test ครอบคลุมอย่างน้อย 80% ของ Code Base ใช้ Integration Test ทดสอบการทำงานร่วมกันของ Module ต่างๆ และ E2E Test สำหรับ Critical User Flow เครื่องมือยอดนิยมเช่น Jest, Pytest, JUnit ช่วยให้การเขียน Test เป็นเรื่องง่าย

เรื่อง Version Control ด้วย Git ใช้ Branch Strategy ที่เหมาะกับทีม เช่น Git Flow สำหรับโปรเจคใหญ่ หรือ Trunk-Based Development สำหรับทีมที่ Deploy บ่อย ทำ Code Review ทุก Pull Request และใช้ CI/CD Pipeline ทำ Automated Testing และ Deployment

เปรียบเทียบข้อดีและข้อเสีย

ข้อดี	ข้อเสีย
ประสิทธิภาพสูง ทำงานได้เร็วและแม่นยำ ลดเวลาทำงานซ้ำซ้อน	ต้องใช้เวลาเรียนรู้เบื้องต้นพอสมควร มี Learning Curve สูง
มี Community ขนาดใหญ่ มีคนช่วยเหลือและแหล่งเรียนรู้มากมาย	บางฟีเจอร์อาจยังไม่เสถียร หรือมีการเปลี่ยนแปลงบ่อยในเวอร์ชันใหม่
รองรับ Integration กับเครื่องมือและบริการอื่นได้หลากหลาย	ต้นทุนอาจสูงสำหรับ Enterprise License หรือ Cloud Service
เป็น Open Source หรือมีเวอร์ชันฟรีให้เริ่มต้นใช้งาน	ต้องการ Hardware หรือ Infrastructure ที่เพียงพอ

จากตารางเปรียบเทียบจะเห็นว่าข้อดีมีมากกว่าข้อเสียอย่างชัดเจน โดยเฉพาะในแง่ของประสิทธิภาพและความสามารถในการ Scale สำหรับข้อเสียส่วนใหญ่สามารถแก้ไขได้ด้วยการเรียนรู้อย่างเป็นระบบและวางแผนทรัพยากรให้เหมาะสม

Incident.io คืออะไร

Incident Management Platform Slack Channel Roles Timeline Status Page Postmortem Catalog PagerDuty Jira GitHub API MTTR Automation

Internal Developer Platform คืออะไร

IDP Platform Developer Self-service Service Catalog CI/CD Infrastructure Observability Incident Documentation Backstage Port Cortex OpsLevel

ตั้งค่า Incident Workflow อย่างไร

Slack App Severity P1-P4 Roles IC Communication SME Escalation Workflow Auto Channel Page Status Page Jira Custom Fields Affected Service

Postmortem ทำอย่างไร

Auto Timeline Slack Summary Impact Root Cause 5 Whys Action Items Owner Deadline Lessons Learned Blameless Review ติดตาม

สรุป

Incident.io Internal Developer Platform Slack Incident Management Automation Postmortem Service Catalog Roles Severity Workflow Production