Pulumi IaC Post-mortem Analysis — วิเคราะห์ปัญหา Infrastructure as Code
Pulumi IaC Post-mortem
Pulumi Infrastructure as Code Python TypeScript Post-mortem Analysis Root Cause Blameless Drift Detection State Management AWS Azure GCP Kubernetes Policy as Code
| IaC Tool | Language | State | Learning Curve | Testing |
|---|---|---|---|---|
| Pulumi | Python/TS/Go/C# | Pulumi Cloud/S3 | ปานกลาง | Unit Test |
| Terraform | HCL | S3/Terraform Cloud | ต่ำ-ปานกลาง | Terratest |
| CDK (AWS) | Python/TS/Java | CloudFormation | ปานกลาง | Unit Test |
| Crossplane | YAML | Kubernetes | สูง | kubectl |
Pulumi Infrastructure Code
=== Pulumi Python Infrastructure ===
pulumi new python
pip install pulumi-aws
__main__.py
import pulumi
import pulumi_aws as aws
# VPC
vpc = aws.ec2.Vpc("main-vpc",
cidr_block="10.0.0.0/16",
enable_dns_hostnames=True,
tags={"Name": "production-vpc", "Environment": "prod"},
)
# Subnets
public_subnet = aws.ec2.Subnet("public-subnet",
vpc_id=vpc.id,
cidr_block="10.0.1.0/24",
availability_zone="ap-southeast-1a",
map_public_ip_on_launch=True,
)
# Security Group
web_sg = aws.ec2.SecurityGroup("web-sg",
vpc_id=vpc.id,
ingress=[
{"protocol": "tcp", "from_port": 80, "to_port": 80,
"cidr_blocks": ["0.0.0.0/0"]},
{"protocol": "tcp", "from_port": 443, "to_port": 443,
"cidr_blocks": ["0.0.0.0/0"]},
],
egress=[
{"protocol": "-1", "from_port": 0, "to_port": 0,
"cidr_blocks": ["0.0.0.0/0"]},
],
)
# RDS
db = aws.rds.Instance("main-db",
engine="postgres",
engine_version="15",
instance_class="db.t3.medium",
allocated_storage=100,
db_name="production",
username="admin",
password=pulumi.Config().require_secret("db_password"),
skip_final_snapshot=False,
backup_retention_period=7,
multi_az=True,
)
pulumi.export("vpc_id", vpc.id)
pulumi.export("db_endpoint", db.endpoint)
CLI Commands
pulumi preview — ดู Changes ก่อน Deploy
pulumi up — Deploy Infrastructure
pulumi refresh — Detect Drift
pulumi destroy — ลบทั้งหมด
pulumi stack ls — ดู Stacks
from dataclasses import dataclass
from typing import List
@dataclass
class PulumiResource:
name: str
type: str
status: str
provider: str
resources = [
PulumiResource("main-vpc", "aws:ec2:Vpc", "created", "aws"),
PulumiResource("public-subnet", "aws:ec2:Subnet", "created", "aws"),
PulumiResource("web-sg", "aws:ec2:SecurityGroup", "created", "aws"),
PulumiResource("main-db", "aws:rds:Instance", "created", "aws"),
PulumiResource("web-cluster", "aws:ecs:Cluster", "updated", "aws"),
PulumiResource("api-service", "aws:ecs:Service", "updated", "aws"),
]
print("=== Pulumi Stack Resources ===")
for r in resources:
print(f" [{r.status}] {r.name} ({r.type})")
Post-mortem Template
=== Post-mortem Analysis ===
Post-mortem Template
## Incident: Database Outage due to IaC Drift
**Date:** 2024-03-15
**Duration:** 45 minutes
**Severity:** P1 - Critical
**Author:** Platform Team
### Timeline
- 14:00 — Alert: Database connection errors
- 14:05 — On-call acknowledges, starts investigation
- 14:10 — Found: Security Group rules changed manually
- 14:15 — Root cause identified: Manual SG change blocked DB port
- 14:20 — pulumi refresh to detect full drift
- 14:25 — pulumi up to restore correct state
- 14:30 — Verified: Database connections restored
- 14:45 — All services healthy, incident resolved
### Root Cause
Engineer manually modified Security Group via AWS Console
to add temporary rule, accidentally deleted port 5432 rule
### Impact
- 45 minutes downtime for all services using PostgreSQL
- ~500 failed API requests
- ~200 affected users
### Action Items
1. Enable AWS Config rule to detect SG changes
2. Add Pulumi Policy to prevent manual changes
3. Schedule drift detection every 15 minutes
4. Add database connectivity check to health checks
@dataclass
class PostMortem:
incident: str
date: str
duration: str
severity: str
root_cause: str
action_items: int
status: str
incidents = [
PostMortem("DB Outage (IaC Drift)", "2024-03-15", "45 min", "P1",
"Manual SG change deleted DB port rule", 4, "Resolved"),
PostMortem("SSL Cert Expired", "2024-02-20", "15 min", "P2",
"Certificate renewal not in IaC", 3, "Resolved"),
PostMortem("Wrong Instance Type", "2024-01-10", "2 hours", "P2",
"Typo in Pulumi config: t3.micro instead of t3.large", 2, "Resolved"),
PostMortem("State Lock Conflict", "2024-01-05", "30 min", "P3",
"Two engineers ran pulumi up simultaneously", 3, "Resolved"),
]
print("\n=== Post-mortem Registry ===")
for pm in incidents:
print(f" [{pm.severity}] {pm.incident} ({pm.date})")
print(f" Duration: {pm.duration} | Root Cause: {pm.root_cause}")
print(f" Actions: {pm.action_items} | Status: {pm.status}")
Drift Detection และ Prevention
=== Drift Detection & Prevention ===
Automated Drift Detection
GitHub Actions — Run every 15 min
name: Drift Detection
on:
schedule:
- cron: '*/15 * * * *'
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pulumi/actions@v5
with:
command: refresh
stack-name: production
expect-no-changes: true
env:
PULUMI_ACCESS_TOKEN: }
Pulumi Policy (CrossGuard)
from pulumi_policy import (
EnforcementLevel, PolicyPack, ResourceValidationPolicy
)
def no_public_s3(args, report_violation):
if args.resource_type == "aws:s3:Bucket":
acl = args.props.get("acl")
if acl == "public-read" or acl == "public-read-write":
report_violation("S3 buckets must not be public")
PolicyPack("security-policies", policies=[
ResourceValidationPolicy(
name="no-public-s3",
description="Prevent public S3 buckets",
validate=no_public_s3,
enforcement_level=EnforcementLevel.MANDATORY,
),
])
prevention = {
"Drift Detection": "pulumi refresh ทุก 15 นาที Alert ถ้าพบ Drift",
"Policy as Code": "CrossGuard ป้องกัน Misconfiguration",
"State Locking": "ล็อค State ป้องกัน Concurrent Update",
"Code Review": "PR Review ทุก Infrastructure Change",
"Testing": "Unit Test + Integration Test ก่อน Deploy",
"Audit Log": "บันทึกทุก Change ใคร ทำอะไร เมื่อไหร่",
"Rollback Plan": "มีแผน Rollback ทุก Deploy",
"No Manual Changes": "ห้ามแก้ผ่าน Console ทำผ่าน Code เท่านั้น",
}
print("Prevention Strategies:")
for strategy, desc in prevention.items():
print(f" [{strategy}]: {desc}")
เคล็ดลับ
- Preview: pulumi preview ทุกครั้งก่อน pulumi up
- Drift: ตรวจ Drift อัตโนมัติทุก 15 นาที
- Blameless: Post-mortem ต้อง Blameless โฟกัสที่ระบบ ไม่โทษคน
- Policy: ใช้ CrossGuard ป้องกัน Misconfiguration
- Test: เขียน Unit Test สำหรับ Infrastructure Code
Pulumi คืออะไร
IaC Platform ภาษา Programming จริง Python TypeScript Go AWS Azure GCP Kubernetes State Management Preview Policy Unit Test