it_devops

Pulumi IaC Post-mortem Analysis — วิเคราะห์ปัญหา Infrastructure as Code | SiamCafe Blog

pulumi iac post mortem analysis
Pulumi IaC Post-mortem Analysis — วิเคราะห์ปัญหา Infrastructure as Code | SiamCafe Blog | SiamCafe Blog
2026-02-21· อ. บอม — SiamCafe.net· 2410 คำ

Pulumi IaC Post-mortem

Pulumi Infrastructure as Code Python TypeScript Post-mortem Analysis Root Cause Blameless Drift Detection State Management AWS Azure GCP Kubernetes Policy as Code

IaC ToolLanguageStateLearning CurveTesting
PulumiPython/TS/Go/C#Pulumi Cloud/S3ปานกลางUnit Test
TerraformHCLS3/Terraform Cloudต่ำ-ปานกลางTerratest
CDK (AWS)Python/TS/JavaCloudFormationปานกลางUnit Test
CrossplaneYAMLKubernetesสูงkubectl

Pulumi Infrastructure Code

# === Pulumi Python Infrastructure ===

# pulumi new python
# pip install pulumi-aws

# __main__.py
# import pulumi
# import pulumi_aws as aws
#
# # VPC
# vpc = aws.ec2.Vpc("main-vpc",
#     cidr_block="10.0.0.0/16",
#     enable_dns_hostnames=True,
#     tags={"Name": "production-vpc", "Environment": "prod"},
# )
#
# # Subnets
# public_subnet = aws.ec2.Subnet("public-subnet",
#     vpc_id=vpc.id,
#     cidr_block="10.0.1.0/24",
#     availability_zone="ap-southeast-1a",
#     map_public_ip_on_launch=True,
# )
#
# # Security Group
# web_sg = aws.ec2.SecurityGroup("web-sg",
#     vpc_id=vpc.id,
#     ingress=[
#         {"protocol": "tcp", "from_port": 80, "to_port": 80,
#          "cidr_blocks": ["0.0.0.0/0"]},
#         {"protocol": "tcp", "from_port": 443, "to_port": 443,
#          "cidr_blocks": ["0.0.0.0/0"]},
#     ],
#     egress=[
#         {"protocol": "-1", "from_port": 0, "to_port": 0,
#          "cidr_blocks": ["0.0.0.0/0"]},
#     ],
# )
#
# # RDS
# db = aws.rds.Instance("main-db",
#     engine="postgres",
#     engine_version="15",
#     instance_class="db.t3.medium",
#     allocated_storage=100,
#     db_name="production",
#     username="admin",
#     password=pulumi.Config().require_secret("db_password"),
#     skip_final_snapshot=False,
#     backup_retention_period=7,
#     multi_az=True,
# )
#
# pulumi.export("vpc_id", vpc.id)
# pulumi.export("db_endpoint", db.endpoint)

# CLI Commands
# pulumi preview    — ดู Changes ก่อน Deploy
# pulumi up         — Deploy Infrastructure
# pulumi refresh    — Detect Drift
# pulumi destroy    — ลบทั้งหมด
# pulumi stack ls   — ดู Stacks

from dataclasses import dataclass
from typing import List

@dataclass
class PulumiResource:
    name: str
    type: str
    status: str
    provider: str

resources = [
    PulumiResource("main-vpc", "aws:ec2:Vpc", "created", "aws"),
    PulumiResource("public-subnet", "aws:ec2:Subnet", "created", "aws"),
    PulumiResource("web-sg", "aws:ec2:SecurityGroup", "created", "aws"),
    PulumiResource("main-db", "aws:rds:Instance", "created", "aws"),
    PulumiResource("web-cluster", "aws:ecs:Cluster", "updated", "aws"),
    PulumiResource("api-service", "aws:ecs:Service", "updated", "aws"),
]

print("=== Pulumi Stack Resources ===")
for r in resources:
    print(f"  [{r.status}] {r.name} ({r.type})")

Post-mortem Template

# === Post-mortem Analysis ===

# Post-mortem Template
# ## Incident: Database Outage due to IaC Drift
# **Date:** 2024-03-15
# **Duration:** 45 minutes
# **Severity:** P1 - Critical
# **Author:** Platform Team
#
# ### Timeline
# - 14:00 — Alert: Database connection errors
# - 14:05 — On-call acknowledges, starts investigation
# - 14:10 — Found: Security Group rules changed manually
# - 14:15 — Root cause identified: Manual SG change blocked DB port
# - 14:20 — pulumi refresh to detect full drift
# - 14:25 — pulumi up to restore correct state
# - 14:30 — Verified: Database connections restored
# - 14:45 — All services healthy, incident resolved
#
# ### Root Cause
# Engineer manually modified Security Group via AWS Console
# to add temporary rule, accidentally deleted port 5432 rule
#
# ### Impact
# - 45 minutes downtime for all services using PostgreSQL
# - ~500 failed API requests
# - ~200 affected users
#
# ### Action Items
# 1. Enable AWS Config rule to detect SG changes
# 2. Add Pulumi Policy to prevent manual changes
# 3. Schedule drift detection every 15 minutes
# 4. Add database connectivity check to health checks

@dataclass
class PostMortem:
    incident: str
    date: str
    duration: str
    severity: str
    root_cause: str
    action_items: int
    status: str

incidents = [
    PostMortem("DB Outage (IaC Drift)", "2024-03-15", "45 min", "P1",
               "Manual SG change deleted DB port rule", 4, "Resolved"),
    PostMortem("SSL Cert Expired", "2024-02-20", "15 min", "P2",
               "Certificate renewal not in IaC", 3, "Resolved"),
    PostMortem("Wrong Instance Type", "2024-01-10", "2 hours", "P2",
               "Typo in Pulumi config: t3.micro instead of t3.large", 2, "Resolved"),
    PostMortem("State Lock Conflict", "2024-01-05", "30 min", "P3",
               "Two engineers ran pulumi up simultaneously", 3, "Resolved"),
]

print("\n=== Post-mortem Registry ===")
for pm in incidents:
    print(f"  [{pm.severity}] {pm.incident} ({pm.date})")
    print(f"    Duration: {pm.duration} | Root Cause: {pm.root_cause}")
    print(f"    Actions: {pm.action_items} | Status: {pm.status}")

Drift Detection และ Prevention

# === Drift Detection & Prevention ===

# Automated Drift Detection
# GitHub Actions — Run every 15 min
# name: Drift Detection
# on:
#   schedule:
#     - cron: '*/15 * * * *'
# jobs:
#   detect-drift:
#     runs-on: ubuntu-latest
#     steps:
#       - uses: actions/checkout@v4
#       - uses: pulumi/actions@v5
#         with:
#           command: refresh
#           stack-name: production
#           expect-no-changes: true
#         env:
#           PULUMI_ACCESS_TOKEN: }

# Pulumi Policy (CrossGuard)
# from pulumi_policy import (
#     EnforcementLevel, PolicyPack, ResourceValidationPolicy
# )
#
# def no_public_s3(args, report_violation):
#     if args.resource_type == "aws:s3:Bucket":
#         acl = args.props.get("acl")
#         if acl == "public-read" or acl == "public-read-write":
#             report_violation("S3 buckets must not be public")
#
# PolicyPack("security-policies", policies=[
#     ResourceValidationPolicy(
#         name="no-public-s3",
#         description="Prevent public S3 buckets",
#         validate=no_public_s3,
#         enforcement_level=EnforcementLevel.MANDATORY,
#     ),
# ])

prevention = {
    "Drift Detection": "pulumi refresh ทุก 15 นาที Alert ถ้าพบ Drift",
    "Policy as Code": "CrossGuard ป้องกัน Misconfiguration",
    "State Locking": "ล็อค State ป้องกัน Concurrent Update",
    "Code Review": "PR Review ทุก Infrastructure Change",
    "Testing": "Unit Test + Integration Test ก่อน Deploy",
    "Audit Log": "บันทึกทุก Change ใคร ทำอะไร เมื่อไหร่",
    "Rollback Plan": "มีแผน Rollback ทุก Deploy",
    "No Manual Changes": "ห้ามแก้ผ่าน Console ทำผ่าน Code เท่านั้น",
}

print("Prevention Strategies:")
for strategy, desc in prevention.items():
    print(f"  [{strategy}]: {desc}")

เคล็ดลับ

Pulumi คืออะไร

IaC Platform ภาษา Programming จริง Python TypeScript Go AWS Azure GCP Kubernetes State Management Preview Policy Unit Test

Post-mortem Analysis คืออะไร

วิเคราะห์หลัง Incident Blameless Timeline Root Cause Impact Action Items Follow-up SRE DevOps Culture ป้องกันเกิดซ้ำ

Pulumi กับ Terraform ต่างกันอย่างไร

Pulumi Programming Language IDE Autocomplete Type Safety Unit Test Terraform HCL DSL ง่ายกว่า Community ใหญ่ Provider มาก

Drift Detection คืออะไร

ตรวจ Infrastructure ตรงกับ Code ไหม Manual Changes Drift pulumi refresh Diff แก้ pulumi up Policy ห้าม Manual

สรุป

Pulumi IaC Python TypeScript Post-mortem Analysis Blameless Root Cause Timeline Drift Detection Policy CrossGuard State Management Preview Testing CI/CD AWS Azure GCP Prevention