SiamCafe · Blog
Pulumi IaC Post-mortem Analysis — วิเคราะห์ปัญหา Infrastructure as Code
บทความ

Pulumi IaC Post-mortem Analysis — วิเคราะห์ปัญหา Infrastructure as Code

เผยแพร่ 28 พฤษภาคม 2569

Pulumi IaC Post-mortem

Pulumi Infrastructure as Code Python TypeScript Post-mortem Analysis Root Cause Blameless Drift Detection State Management AWS Azure GCP Kubernetes Policy as Code

IaC ToolLanguageStateLearning CurveTesting
PulumiPython/TS/Go/C#Pulumi Cloud/S3ปานกลางUnit Test
TerraformHCLS3/Terraform Cloudต่ำ-ปานกลางTerratest
CDK (AWS)Python/TS/JavaCloudFormationปานกลางUnit Test
CrossplaneYAMLKubernetesสูงkubectl

Pulumi Infrastructure Code

=== Pulumi Python Infrastructure ===

pulumi new python

pip install pulumi-aws

__main__.py

import pulumi

import pulumi_aws as aws

# VPC

vpc = aws.ec2.Vpc("main-vpc",

cidr_block="10.0.0.0/16",

enable_dns_hostnames=True,

tags={"Name": "production-vpc", "Environment": "prod"},

)

# Subnets

public_subnet = aws.ec2.Subnet("public-subnet",

vpc_id=vpc.id,

cidr_block="10.0.1.0/24",

availability_zone="ap-southeast-1a",

map_public_ip_on_launch=True,

)

# Security Group

web_sg = aws.ec2.SecurityGroup("web-sg",

vpc_id=vpc.id,

ingress=[

{"protocol": "tcp", "from_port": 80, "to_port": 80,

"cidr_blocks": ["0.0.0.0/0"]},

{"protocol": "tcp", "from_port": 443, "to_port": 443,

"cidr_blocks": ["0.0.0.0/0"]},

],

egress=[

{"protocol": "-1", "from_port": 0, "to_port": 0,

"cidr_blocks": ["0.0.0.0/0"]},

],

)

# RDS

db = aws.rds.Instance("main-db",

engine="postgres",

engine_version="15",

instance_class="db.t3.medium",

allocated_storage=100,

db_name="production",

username="admin",

password=pulumi.Config().require_secret("db_password"),

skip_final_snapshot=False,

backup_retention_period=7,

multi_az=True,

)

pulumi.export("vpc_id", vpc.id)

pulumi.export("db_endpoint", db.endpoint)

CLI Commands

pulumi preview — ดู Changes ก่อน Deploy

pulumi up — Deploy Infrastructure

pulumi refresh — Detect Drift

pulumi destroy — ลบทั้งหมด

pulumi stack ls — ดู Stacks

from dataclasses import dataclass

from typing import List

@dataclass

class PulumiResource:

name: str

type: str

status: str

provider: str

resources = [

PulumiResource("main-vpc", "aws:ec2:Vpc", "created", "aws"),

PulumiResource("public-subnet", "aws:ec2:Subnet", "created", "aws"),

PulumiResource("web-sg", "aws:ec2:SecurityGroup", "created", "aws"),

PulumiResource("main-db", "aws:rds:Instance", "created", "aws"),

PulumiResource("web-cluster", "aws:ecs:Cluster", "updated", "aws"),

PulumiResource("api-service", "aws:ecs:Service", "updated", "aws"),

]

print("=== Pulumi Stack Resources ===")

for r in resources:

print(f" [{r.status}] {r.name} ({r.type})")

Post-mortem Template

=== Post-mortem Analysis ===

Post-mortem Template

## Incident: Database Outage due to IaC Drift

**Date:** 2024-03-15

**Duration:** 45 minutes

**Severity:** P1 - Critical

**Author:** Platform Team

### Timeline

  • 14:00 — Alert: Database connection errors
  • 14:05 — On-call acknowledges, starts investigation
  • 14:10 — Found: Security Group rules changed manually
  • 14:15 — Root cause identified: Manual SG change blocked DB port
  • 14:20 — pulumi refresh to detect full drift
  • 14:25 — pulumi up to restore correct state
  • 14:30 — Verified: Database connections restored
  • 14:45 — All services healthy, incident resolved

### Root Cause

Engineer manually modified Security Group via AWS Console

to add temporary rule, accidentally deleted port 5432 rule

### Impact

  • 45 minutes downtime for all services using PostgreSQL
  • ~500 failed API requests
  • ~200 affected users

### Action Items

1. Enable AWS Config rule to detect SG changes

2. Add Pulumi Policy to prevent manual changes

3. Schedule drift detection every 15 minutes

4. Add database connectivity check to health checks

@dataclass

class PostMortem:

incident: str

date: str

duration: str

severity: str

root_cause: str

action_items: int

status: str

incidents = [

PostMortem("DB Outage (IaC Drift)", "2024-03-15", "45 min", "P1",

"Manual SG change deleted DB port rule", 4, "Resolved"),

PostMortem("SSL Cert Expired", "2024-02-20", "15 min", "P2",

"Certificate renewal not in IaC", 3, "Resolved"),

PostMortem("Wrong Instance Type", "2024-01-10", "2 hours", "P2",

"Typo in Pulumi config: t3.micro instead of t3.large", 2, "Resolved"),

PostMortem("State Lock Conflict", "2024-01-05", "30 min", "P3",

"Two engineers ran pulumi up simultaneously", 3, "Resolved"),

]

print("\n=== Post-mortem Registry ===")

for pm in incidents:

print(f" [{pm.severity}] {pm.incident} ({pm.date})")

print(f" Duration: {pm.duration} | Root Cause: {pm.root_cause}")

print(f" Actions: {pm.action_items} | Status: {pm.status}")

Drift Detection และ Prevention

=== Drift Detection & Prevention ===

Automated Drift Detection

GitHub Actions — Run every 15 min

name: Drift Detection

on:

schedule:

  • cron: '*/15 * * * *'

jobs:

detect-drift:

runs-on: ubuntu-latest

steps:

  • uses: actions/checkout@v4
  • uses: pulumi/actions@v5

with:

command: refresh

stack-name: production

expect-no-changes: true

env:

PULUMI_ACCESS_TOKEN: }

Pulumi Policy (CrossGuard)

from pulumi_policy import (

EnforcementLevel, PolicyPack, ResourceValidationPolicy

)

def no_public_s3(args, report_violation):

if args.resource_type == "aws:s3:Bucket":

acl = args.props.get("acl")

if acl == "public-read" or acl == "public-read-write":

report_violation("S3 buckets must not be public")

PolicyPack("security-policies", policies=[

ResourceValidationPolicy(

name="no-public-s3",

description="Prevent public S3 buckets",

validate=no_public_s3,

enforcement_level=EnforcementLevel.MANDATORY,

),

])

prevention = {

"Drift Detection": "pulumi refresh ทุก 15 นาที Alert ถ้าพบ Drift",

"Policy as Code": "CrossGuard ป้องกัน Misconfiguration",

"State Locking": "ล็อค State ป้องกัน Concurrent Update",

"Code Review": "PR Review ทุก Infrastructure Change",

"Testing": "Unit Test + Integration Test ก่อน Deploy",

"Audit Log": "บันทึกทุก Change ใคร ทำอะไร เมื่อไหร่",

"Rollback Plan": "มีแผน Rollback ทุก Deploy",

"No Manual Changes": "ห้ามแก้ผ่าน Console ทำผ่าน Code เท่านั้น",

}

print("Prevention Strategies:")

for strategy, desc in prevention.items():

print(f" [{strategy}]: {desc}")

เคล็ดลับ

  • Preview: pulumi preview ทุกครั้งก่อน pulumi up
  • Drift: ตรวจ Drift อัตโนมัติทุก 15 นาที
  • Blameless: Post-mortem ต้อง Blameless โฟกัสที่ระบบ ไม่โทษคน
  • Policy: ใช้ CrossGuard ป้องกัน Misconfiguration
  • Test: เขียน Unit Test สำหรับ Infrastructure Code

Pulumi คืออะไร

IaC Platform ภาษา Programming จริง Python TypeScript Go AWS Azure GCP Kubernetes State Management Preview Policy Unit Test