oVirt SRE Practices

oVirt KVM Virtualization SRE SLI SLO Error Budget Automation Ansible Terraform Monitoring Prometheus Grafana Production

SLISLO TargetMeasurementAlert Threshold
VM Availability> 99.9%VM Up Time / Total TimeAny unexpected VM Down
VM Boot Time< 60 secondsAPI call → VM Running> 120 seconds
Live Migration< 30 secondsMigration Start → Complete> 60 seconds
Storage Latency P99< 5msDisk I/O Latency> 10ms
Engine API Response< 2 secondsAPI Call Duration> 5 seconds

Monitoring Setup

# === oVirt Monitoring with Prometheus ===

# Prometheus config (prometheus.yml)
# scrape_configs:
#   - job_name: 'ovirt-hosts'
#     static_configs:
#       - targets: ['host1:9100', 'host2:9100', 'host3:9100']
#   - job_name: 'ovirt-engine'
#     static_configs:
#       - targets: ['engine:9100']
#   - job_name: 'ovirt-exporter'
#     static_configs:
#       - targets: ['ovirt-exporter:9325']

# Alert Rules (alerts.yml)
# groups:
#   - name: ovirt
#     rules:
#       - alert: HostCPUHigh
#         expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
#         for: 5m
#         labels: { severity: warning }
#       - alert: HostRAMCritical
#         expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.95
#         for: 2m
#         labels: { severity: critical }
#       - alert: VMUnexpectedDown
#         expr: ovirt_vm_status{status!="up"} == 1
#         for: 1m
#         labels: { severity: critical }

from dataclasses import dataclass

@dataclass
class MonitorLayer:
    layer: str
    metrics: str
    tool: str
    alert_example: str

layers = [
    MonitorLayer("Host (Physical)",
        "CPU RAM Disk I/O Network Temperature",
        "Prometheus + Node Exporter",
        "CPU > 90% 5min → Warning RAM > 95% → Critical"),
    MonitorLayer("VM (Virtual Machine)",
        "vCPU vRAM vDisk IOPS Network VM State",
        "oVirt Exporter + Prometheus",
        "VM Down Unexpected → Critical IOPS > Threshold"),
    MonitorLayer("oVirt Engine",
        "API Response Time DB Pool Active Tasks",
        "Prometheus + Custom Exporter",
        "API > 5s → Critical DB Connection > 90%"),
    MonitorLayer("Storage",
        "IOPS Latency Throughput Capacity Used%",
        "Node Exporter + Storage Exporter",
        "Latency P99 > 10ms → Warning Capacity > 85%"),
    MonitorLayer("Network",
        "Bandwidth Packet Loss Latency Errors",
        "Node Exporter + SNMP Exporter",
        "Packet Loss > 0.1% → Warning Bond Down → Critical"),
]

print("=== Monitoring Layers ===")
for l in layers:
    print(f"  [{l.layer}] Metrics: {l.metrics}")
    print(f"    Tool: {l.tool}")
    print(f"    Alert: {l.alert_example}")

Automation with Ansible

# === Ansible Automation for oVirt ===

# Install: ansible-galaxy collection install ovirt.ovirt

# Playbook: Create VM from Template
# - hosts: localhost
#   connection: local
#   collections:
#     - ovirt.ovirt
#   tasks:
#     - ovirt_auth:
#         url: https://engine.example.com/ovirt-engine/api
#         username: admin@internal
#         password: "{{ vault_ovirt_password }}"
#     - ovirt_vm:
#         auth: "{{ ovirt_auth }}"
#         name: web-server-01
#         template: centos9-template
#         cluster: production
#         memory: 4GiB
#         cpu_cores: 2
#         state: running
#         nics:
#           - name: nic1
#             network: production-net
#     - ovirt_auth:
#         state: absent
#         ovirt_auth: "{{ ovirt_auth }}"

@dataclass
class AutomationTask:
    task: str
    tool: str
    trigger: str
    playbook: str

tasks = [
    AutomationTask("VM Provisioning",
        "Ansible ovirt.ovirt",
        "Jira Ticket / API Request",
        "create_vm.yml: สร้าง VM จาก Template ตั้ง Network IP DNS"),
    AutomationTask("Host Patching",
        "Ansible ovirt.ovirt + yum",
        "Monthly Patch Window",
        "patch_host.yml: Maintenance → Migrate VMs → Patch → Reboot → Activate"),
    AutomationTask("VM Backup",
        "Ansible + oVirt API",
        "Daily Cron 02:00",
        "backup_vm.yml: Snapshot → Export → Upload S3 → Delete Old"),
    AutomationTask("Capacity Report",
        "Python + oVirt SDK",
        "Weekly Monday 09:00",
        "capacity_report.py: CPU RAM Storage Usage Trend → Email Report"),
    AutomationTask("Disaster Recovery",
        "Ansible + oVirt API",
        "DR Drill Quarterly / Actual DR",
        "dr_failover.yml: Import VM → Start → Verify → Update DNS"),
]

print("=== Automation Tasks ===")
for t in tasks:
    print(f"  [{t.task}] Tool: {t.tool}")
    print(f"    Trigger: {t.trigger}")
    print(f"    Playbook: {t.playbook}")

Capacity Planning

# === Capacity Planning ===

@dataclass
class CapacityMetric:
    resource: str
    current_usage: str
    threshold: str
    forecast: str
    action: str

capacity = [
    CapacityMetric("CPU (Total Cluster)",
        "65% average 85% peak",
        "Warning 70% avg Critical 90% peak",
        "เพิ่ม 5% ต่อเดือน → Full ใน 7 เดือน",
        "เพิ่ม Host 2 ตัว ใน Q3"),
    CapacityMetric("RAM (Total Cluster)",
        "72% allocated 55% actual",
        "Warning 80% allocated Critical 90%",
        "เพิ่ม 8% ต่อเดือน → Full ใน 4 เดือน",
        "เพิ่ม RAM แต่ละ Host 64GB → 128GB"),
    CapacityMetric("Storage (NFS)",
        "4.2TB / 6TB (70%)",
        "Warning 80% Critical 90%",
        "เพิ่ม 200GB ต่อเดือน → Full ใน 9 เดือน",
        "เพิ่ม Storage Volume 4TB ใน Q3"),
    CapacityMetric("Network (10Gbps Bond)",
        "3.5Gbps peak 2.1Gbps avg",
        "Warning 70% peak Critical 90%",
        "เพิ่ม 10% ต่อเดือน → Full ใน 12 เดือน",
        "พิจารณา 25Gbps Upgrade ใน Q4"),
]

print("=== Capacity Planning ===")
for c in capacity:
    print(f"  [{c.resource}] Current: {c.current_usage}")
    print(f"    Threshold: {c.threshold}")
    print(f"    Forecast: {c.forecast}")
    print(f"    Action: {c.action}")

เคล็ดลับ

  • SLO: ตั้ง SLO ชัดเจน วัดทุกสัปดาห์ ใช้ Error Budget บริหาร Change
  • Ansible: Automate ทุกงาน Manual ลด Toil
  • Template: ใช้ VM Template มาตรฐาน สร้าง VM เร็ว Consistent
  • HA: เปิด HA สำหรับ Critical VM Auto-restart เมื่อ Host ล่ม
  • Capacity: วิเคราะห์ Trend ทุกเดือน วางแผนขยายล่วงหน้า

oVirt คืออะไร

Open Source KVM Virtualization Red Hat RHV Engine Host Web UI REST API Live Migration HA Snapshot Template Quota ฟรี Private Cloud