oVirt Virtualization Site Reliability SRE —

🤖 AI โดย อ.บอม กิตติทัศน์ เจริญพนาสิทธิ์ · เผยแพร่ 2026-05-28

oVirt SRE Practices

oVirt KVM Virtualization SRE SLI SLO Error Budget Automation Ansible Terraform Monitoring Prometheus Grafana Production

SLI	SLO Target	Measurement	Alert Threshold
VM Availability	> 99.9%	VM Up Time / Total Time	Any unexpected VM Down
VM Boot Time	< 60 seconds	API call → VM Running	> 120 seconds
Live Migration	< 30 seconds	Migration Start → Complete	> 60 seconds
Storage Latency P99	< 5ms	Disk I/O Latency	> 10ms
Engine API Response	< 2 seconds	API Call Duration	> 5 seconds

Monitoring Setup

# === oVirt Monitoring with Prometheus ===

# Prometheus config (prometheus.yml)
# scrape_configs:
#   - job_name: 'ovirt-hosts'
#     static_configs:
#       - targets: ['host1:9100', 'host2:9100', 'host3:9100']
#   - job_name: 'ovirt-engine'
#     static_configs:
#       - targets: ['engine:9100']
#   - job_name: 'ovirt-exporter'
#     static_configs:
#       - targets: ['ovirt-exporter:9325']

# Alert Rules (alerts.yml)
# groups:
#   - name: ovirt
#     rules:
#       - alert: HostCPUHigh
#         expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
#         for: 5m
#         labels: { severity: warning }
#       - alert: HostRAMCritical
#         expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.95
#         for: 2m
#         labels: { severity: critical }
#       - alert: VMUnexpectedDown
#         expr: ovirt_vm_status{status!="up"} == 1
#         for: 1m
#         labels: { severity: critical }

from dataclasses import dataclass

@dataclass
class MonitorLayer:
    layer: str
    metrics: str
    tool: str
    alert_example: str

layers = [
    MonitorLayer("Host (Physical)",
        "CPU RAM Disk I/O Network Temperature",
        "Prometheus + Node Exporter",
        "CPU > 90% 5min → Warning RAM > 95% → Critical"),
    MonitorLayer("VM (Virtual Machine)",
        "vCPU vRAM vDisk IOPS Network VM State",
        "oVirt Exporter + Prometheus",
        "VM Down Unexpected → Critical IOPS > Threshold"),
    MonitorLayer("oVirt Engine",
        "API Response Time DB Pool Active Tasks",
        "Prometheus + Custom Exporter",
        "API > 5s → Critical DB Connection > 90%"),
    MonitorLayer("Storage",
        "IOPS Latency Throughput Capacity Used%",
        "Node Exporter + Storage Exporter",
        "Latency P99 > 10ms → Warning Capacity > 85%"),
    MonitorLayer("Network",
        "Bandwidth Packet Loss Latency Errors",
        "Node Exporter + SNMP Exporter",
        "Packet Loss > 0.1% → Warning Bond Down → Critical"),
]

print("=== Monitoring Layers ===")
for l in layers:
    print(f"  [{l.layer}] Metrics: {l.metrics}")
    print(f"    Tool: {l.tool}")
    print(f"    Alert: {l.alert_example}")

Automation with Ansible

# === Ansible Automation for oVirt ===

# Install: ansible-galaxy collection install ovirt.ovirt

# Playbook: Create VM from Template
# - hosts: localhost
#   connection: local
#   collections:
#     - ovirt.ovirt
#   tasks:
#     - ovirt_auth:
#         url: https://engine.example.com/ovirt-engine/api
#         username: admin@internal
#         password: "{{ vault_ovirt_password }}"
#     - ovirt_vm:
#         auth: "{{ ovirt_auth }}"
#         name: web-server-01
#         template: centos9-template
#         cluster: production
#         memory: 4GiB
#         cpu_cores: 2
#         state: running
#         nics:
#           - name: nic1
#             network: production-net
#     - ovirt_auth:
#         state: absent
#         ovirt_auth: "{{ ovirt_auth }}"

@dataclass
class AutomationTask:
    task: str
    tool: str
    trigger: str
    playbook: str

tasks = [
    AutomationTask("VM Provisioning",
        "Ansible ovirt.ovirt",
        "Jira Ticket / API Request",
        "create_vm.yml: สร้าง VM จาก Template ตั้ง Network IP DNS"),
    AutomationTask("Host Patching",
        "Ansible ovirt.ovirt + yum",
        "Monthly Patch Window",
        "patch_host.yml: Maintenance → Migrate VMs → Patch → Reboot → Activate"),
    AutomationTask("VM Backup",
        "Ansible + oVirt API",
        "Daily Cron 02:00",
        "backup_vm.yml: Snapshot → Export → Upload S3 → Delete Old"),
    AutomationTask("Capacity Report",
        "Python + oVirt SDK",
        "Weekly Monday 09:00",
        "capacity_report.py: CPU RAM Storage Usage Trend → Email Report"),
    AutomationTask("Disaster Recovery",
        "Ansible + oVirt API",
        "DR Drill Quarterly / Actual DR",
        "dr_failover.yml: Import VM → Start → Verify → Update DNS"),
]

print("=== Automation Tasks ===")
for t in tasks:
    print(f"  [{t.task}] Tool: {t.tool}")
    print(f"    Trigger: {t.trigger}")
    print(f"    Playbook: {t.playbook}")

Capacity Planning

# === Capacity Planning ===

@dataclass
class CapacityMetric:
    resource: str
    current_usage: str
    threshold: str
    forecast: str
    action: str

capacity = [
    CapacityMetric("CPU (Total Cluster)",
        "65% average 85% peak",
        "Warning 70% avg Critical 90% peak",
        "เพิ่ม 5% ต่อเดือน → Full ใน 7 เดือน",
        "เพิ่ม Host 2 ตัว ใน Q3"),
    CapacityMetric("RAM (Total Cluster)",
        "72% allocated 55% actual",
        "Warning 80% allocated Critical 90%",
        "เพิ่ม 8% ต่อเดือน → Full ใน 4 เดือน",
        "เพิ่ม RAM แต่ละ Host 64GB → 128GB"),
    CapacityMetric("Storage (NFS)",
        "4.2TB / 6TB (70%)",
        "Warning 80% Critical 90%",
        "เพิ่ม 200GB ต่อเดือน → Full ใน 9 เดือน",
        "เพิ่ม Storage Volume 4TB ใน Q3"),
    CapacityMetric("Network (10Gbps Bond)",
        "3.5Gbps peak 2.1Gbps avg",
        "Warning 70% peak Critical 90%",
        "เพิ่ม 10% ต่อเดือน → Full ใน 12 เดือน",
        "พิจารณา 25Gbps Upgrade ใน Q4"),
]

print("=== Capacity Planning ===")
for c in capacity:
    print(f"  [{c.resource}] Current: {c.current_usage}")
    print(f"    Threshold: {c.threshold}")
    print(f"    Forecast: {c.forecast}")
    print(f"    Action: {c.action}")

เคล็ดลับ

SLO: ตั้ง SLO ชัดเจน วัดทุกสัปดาห์ ใช้ Error Budget บริหาร Change
Ansible: Automate ทุกงาน Manual ลด Toil
Template: ใช้ VM Template มาตรฐาน สร้าง VM เร็ว Consistent
HA: เปิด HA สำหรับ Critical VM Auto-restart เมื่อ Host ล่ม
Capacity: วิเคราะห์ Trend ทุกเดือน วางแผนขยายล่วงหน้า

oVirt คืออะไร

Open Source KVM Virtualization Red Hat RHV Engine Host Web UI REST API Live Migration HA Snapshot Template Quota ฟรี Private Cloud

SRE Practices สำหรับ Virtualization มีอะไร

SLI SLO Error Budget VM Availability 99.9% Boot Time 60s Migration 30s Storage Latency 5ms Toil Reduction Automation Runbook

Monitoring ตั้งอย่างไร

Prometheus Node Exporter oVirt Exporter Grafana Zabbix ELK Host VM Engine Storage Network Alert CPU RAM Disk IOPS Latency

Automation ทำอย่างไร

Ansible ovirt.ovirt Terraform oVirt Provider Python SDK VM Provisioning Patching Backup Capacity Report DR Failover Runbook IaC Git

สรุป

oVirt KVM Virtualization SRE SLI SLO Error Budget Ansible Terraform Prometheus Grafana Monitoring Automation Capacity Planning Production