Pulumi IaC Post-mortem
Pulumi Infrastructure as Code Python TypeScript Post-mortem Analysis Root Cause Blameless Drift Detection State Management AWS Azure GCP Kubernetes Policy as Code
| IaC Tool | Language | State | Learning Curve | Testing |
|---|---|---|---|---|
| Pulumi | Python/TS/Go/C# | Pulumi Cloud/S3 | ปานกลาง | Unit Test |
| Terraform | HCL | S3/Terraform Cloud | ต่ำ-ปานกลาง | Terratest |
| CDK (AWS) | Python/TS/Java | CloudFormation | ปานกลาง | Unit Test |
| Crossplane | YAML | Kubernetes | สูง | kubectl |
Pulumi Infrastructure Code
# === Pulumi Python Infrastructure ===
# pulumi new python
# pip install pulumi-aws
# __main__.py
# import pulumi
# import pulumi_aws as aws
#
# # VPC
# vpc = aws.ec2.Vpc("main-vpc",
# cidr_block="10.0.0.0/16",
# enable_dns_hostnames=True,
# tags={"Name": "production-vpc", "Environment": "prod"},
# )
#
# # Subnets
# public_subnet = aws.ec2.Subnet("public-subnet",
# vpc_id=vpc.id,
# cidr_block="10.0.1.0/24",
# availability_zone="ap-southeast-1a",
# map_public_ip_on_launch=True,
# )
#
# # Security Group
# web_sg = aws.ec2.SecurityGroup("web-sg",
# vpc_id=vpc.id,
# ingress=[
# {"protocol": "tcp", "from_port": 80, "to_port": 80,
# "cidr_blocks": ["0.0.0.0/0"]},
# {"protocol": "tcp", "from_port": 443, "to_port": 443,
# "cidr_blocks": ["0.0.0.0/0"]},
# ],
# egress=[
# {"protocol": "-1", "from_port": 0, "to_port": 0,
# "cidr_blocks": ["0.0.0.0/0"]},
# ],
# )
#
# # RDS
# db = aws.rds.Instance("main-db",
# engine="postgres",
# engine_version="15",
# instance_class="db.t3.medium",
# allocated_storage=100,
# db_name="production",
# username="admin",
# password=pulumi.Config().require_secret("db_password"),
# skip_final_snapshot=False,
# backup_retention_period=7,
# multi_az=True,
# )
#
# pulumi.export("vpc_id", vpc.id)
# pulumi.export("db_endpoint", db.endpoint)
# CLI Commands
# pulumi preview — ดู Changes ก่อน Deploy
# pulumi up — Deploy Infrastructure
# pulumi refresh — Detect Drift
# pulumi destroy — ลบทั้งหมด
# pulumi stack ls — ดู Stacks
from dataclasses import dataclass
from typing import List
@dataclass
class PulumiResource:
name: str
type: str
status: str
provider: str
resources = [
PulumiResource("main-vpc", "aws:ec2:Vpc", "created", "aws"),
PulumiResource("public-subnet", "aws:ec2:Subnet", "created", "aws"),
PulumiResource("web-sg", "aws:ec2:SecurityGroup", "created", "aws"),
PulumiResource("main-db", "aws:rds:Instance", "created", "aws"),
PulumiResource("web-cluster", "aws:ecs:Cluster", "updated", "aws"),
PulumiResource("api-service", "aws:ecs:Service", "updated", "aws"),
]
print("=== Pulumi Stack Resources ===")
for r in resources:
print(f" [{r.status}] {r.name} ({r.type})")
Post-mortem Template
# === Post-mortem Analysis ===
# Post-mortem Template
# ## Incident: Database Outage due to IaC Drift
# **Date:** 2024-03-15
# **Duration:** 45 minutes
# **Severity:** P1 - Critical
# **Author:** Platform Team
#
# ### Timeline
# - 14:00 — Alert: Database connection errors
# - 14:05 — On-call acknowledges, starts investigation
# - 14:10 — Found: Security Group rules changed manually
# - 14:15 — Root cause identified: Manual SG change blocked DB port
# - 14:20 — pulumi refresh to detect full drift
# - 14:25 — pulumi up to restore correct state
# - 14:30 — Verified: Database connections restored
# - 14:45 — All services healthy, incident resolved
#
# ### Root Cause
# Engineer manually modified Security Group via AWS Console
# to add temporary rule, accidentally deleted port 5432 rule
#
# ### Impact
# - 45 minutes downtime for all services using PostgreSQL
# - ~500 failed API requests
# - ~200 affected users
#
# ### Action Items
# 1. Enable AWS Config rule to detect SG changes
# 2. Add Pulumi Policy to prevent manual changes
# 3. Schedule drift detection every 15 minutes
# 4. Add database connectivity check to health checks
@dataclass
class PostMortem:
incident: str
date: str
duration: str
severity: str
root_cause: str
action_items: int
status: str
incidents = [
PostMortem("DB Outage (IaC Drift)", "2024-03-15", "45 min", "P1",
"Manual SG change deleted DB port rule", 4, "Resolved"),
PostMortem("SSL Cert Expired", "2024-02-20", "15 min", "P2",
"Certificate renewal not in IaC", 3, "Resolved"),
PostMortem("Wrong Instance Type", "2024-01-10", "2 hours", "P2",
"Typo in Pulumi config: t3.micro instead of t3.large", 2, "Resolved"),
PostMortem("State Lock Conflict", "2024-01-05", "30 min", "P3",
"Two engineers ran pulumi up simultaneously", 3, "Resolved"),
]
print("\n=== Post-mortem Registry ===")
for pm in incidents:
print(f" [{pm.severity}] {pm.incident} ({pm.date})")
print(f" Duration: {pm.duration} | Root Cause: {pm.root_cause}")
print(f" Actions: {pm.action_items} | Status: {pm.status}")
Drift Detection และ Prevention
# === Drift Detection & Prevention ===
# Automated Drift Detection
# GitHub Actions — Run every 15 min
# name: Drift Detection
# on:
# schedule:
# - cron: '*/15 * * * *'
# jobs:
# detect-drift:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v4
# - uses: pulumi/actions@v5
# with:
# command: refresh
# stack-name: production
# expect-no-changes: true
# env:
# PULUMI_ACCESS_TOKEN: }
# Pulumi Policy (CrossGuard)
# from pulumi_policy import (
# EnforcementLevel, PolicyPack, ResourceValidationPolicy
# )
#
# def no_public_s3(args, report_violation):
# if args.resource_type == "aws:s3:Bucket":
# acl = args.props.get("acl")
# if acl == "public-read" or acl == "public-read-write":
# report_violation("S3 buckets must not be public")
#
# PolicyPack("security-policies", policies=[
# ResourceValidationPolicy(
# name="no-public-s3",
# description="Prevent public S3 buckets",
# validate=no_public_s3,
# enforcement_level=EnforcementLevel.MANDATORY,
# ),
# ])
prevention = {
"Drift Detection": "pulumi refresh ทุก 15 นาที Alert ถ้าพบ Drift",
"Policy as Code": "CrossGuard ป้องกัน Misconfiguration",
"State Locking": "ล็อค State ป้องกัน Concurrent Update",
"Code Review": "PR Review ทุก Infrastructure Change",
"Testing": "Unit Test + Integration Test ก่อน Deploy",
"Audit Log": "บันทึกทุก Change ใคร ทำอะไร เมื่อไหร่",
"Rollback Plan": "มีแผน Rollback ทุก Deploy",
"No Manual Changes": "ห้ามแก้ผ่าน Console ทำผ่าน Code เท่านั้น",
}
print("Prevention Strategies:")
for strategy, desc in prevention.items():
print(f" [{strategy}]: {desc}")
เคล็ดลับ
- Preview: pulumi preview ทุกครั้งก่อน pulumi up
- Drift: ตรวจ Drift อัตโนมัติทุก 15 นาที
- Blameless: Post-mortem ต้อง Blameless โฟกัสที่ระบบ ไม่โทษคน
- Policy: ใช้ CrossGuard ป้องกัน Misconfiguration
- Test: เขียน Unit Test สำหรับ Infrastructure Code
Pulumi คืออะไร
IaC Platform ภาษา Programming จริง Python TypeScript Go AWS Azure GCP Kubernetes State Management Preview Policy Unit Test
Post-mortem Analysis คืออะไร
วิเคราะห์หลัง Incident Blameless Timeline Root Cause Impact Action Items Follow-up SRE DevOps Culture ป้องกันเกิดซ้ำ
Pulumi กับ Terraform ต่างกันอย่างไร
Pulumi Programming Language IDE Autocomplete Type Safety Unit Test Terraform HCL DSL ง่ายกว่า Community ใหญ่ Provider มาก
Drift Detection คืออะไร
ตรวจ Infrastructure ตรงกับ Code ไหม Manual Changes Drift pulumi refresh Diff แก้ pulumi up Policy ห้าม Manual
สรุป
Pulumi IaC Python TypeScript Post-mortem Analysis Blameless Root Cause Timeline Drift Detection Policy CrossGuard State Management Preview Testing CI/CD AWS Azure GCP Prevention
อ่านเพิ่มเติม: สอนเทรด Forex | XM Signal | IT Hardware | อาชีพ IT
