oVirt SRE Practices
oVirt KVM Virtualization SRE SLI SLO Error Budget Automation Ansible Terraform Monitoring Prometheus Grafana Production
| SLI | SLO Target | Measurement | Alert Threshold |
|---|---|---|---|
| VM Availability | > 99.9% | VM Up Time / Total Time | Any unexpected VM Down |
| VM Boot Time | < 60 seconds | API call → VM Running | > 120 seconds |
| Live Migration | < 30 seconds | Migration Start → Complete | > 60 seconds |
| Storage Latency P99 | < 5ms | Disk I/O Latency | > 10ms |
| Engine API Response | < 2 seconds | API Call Duration | > 5 seconds |
Monitoring Setup
# === oVirt Monitoring with Prometheus ===
# Prometheus config (prometheus.yml)
# scrape_configs:
# - job_name: 'ovirt-hosts'
# static_configs:
# - targets: ['host1:9100', 'host2:9100', 'host3:9100']
# - job_name: 'ovirt-engine'
# static_configs:
# - targets: ['engine:9100']
# - job_name: 'ovirt-exporter'
# static_configs:
# - targets: ['ovirt-exporter:9325']
# Alert Rules (alerts.yml)
# groups:
# - name: ovirt
# rules:
# - alert: HostCPUHigh
# expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
# for: 5m
# labels: { severity: warning }
# - alert: HostRAMCritical
# expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.95
# for: 2m
# labels: { severity: critical }
# - alert: VMUnexpectedDown
# expr: ovirt_vm_status{status!="up"} == 1
# for: 1m
# labels: { severity: critical }
from dataclasses import dataclass
@dataclass
class MonitorLayer:
layer: str
metrics: str
tool: str
alert_example: str
layers = [
MonitorLayer("Host (Physical)",
"CPU RAM Disk I/O Network Temperature",
"Prometheus + Node Exporter",
"CPU > 90% 5min → Warning RAM > 95% → Critical"),
MonitorLayer("VM (Virtual Machine)",
"vCPU vRAM vDisk IOPS Network VM State",
"oVirt Exporter + Prometheus",
"VM Down Unexpected → Critical IOPS > Threshold"),
MonitorLayer("oVirt Engine",
"API Response Time DB Pool Active Tasks",
"Prometheus + Custom Exporter",
"API > 5s → Critical DB Connection > 90%"),
MonitorLayer("Storage",
"IOPS Latency Throughput Capacity Used%",
"Node Exporter + Storage Exporter",
"Latency P99 > 10ms → Warning Capacity > 85%"),
MonitorLayer("Network",
"Bandwidth Packet Loss Latency Errors",
"Node Exporter + SNMP Exporter",
"Packet Loss > 0.1% → Warning Bond Down → Critical"),
]
print("=== Monitoring Layers ===")
for l in layers:
print(f" [{l.layer}] Metrics: {l.metrics}")
print(f" Tool: {l.tool}")
print(f" Alert: {l.alert_example}")
Automation with Ansible
# === Ansible Automation for oVirt ===
# Install: ansible-galaxy collection install ovirt.ovirt
# Playbook: Create VM from Template
# - hosts: localhost
# connection: local
# collections:
# - ovirt.ovirt
# tasks:
# - ovirt_auth:
# url: https://engine.example.com/ovirt-engine/api
# username: admin@internal
# password: "{{ vault_ovirt_password }}"
# - ovirt_vm:
# auth: "{{ ovirt_auth }}"
# name: web-server-01
# template: centos9-template
# cluster: production
# memory: 4GiB
# cpu_cores: 2
# state: running
# nics:
# - name: nic1
# network: production-net
# - ovirt_auth:
# state: absent
# ovirt_auth: "{{ ovirt_auth }}"
@dataclass
class AutomationTask:
task: str
tool: str
trigger: str
playbook: str
tasks = [
AutomationTask("VM Provisioning",
"Ansible ovirt.ovirt",
"Jira Ticket / API Request",
"create_vm.yml: สร้าง VM จาก Template ตั้ง Network IP DNS"),
AutomationTask("Host Patching",
"Ansible ovirt.ovirt + yum",
"Monthly Patch Window",
"patch_host.yml: Maintenance → Migrate VMs → Patch → Reboot → Activate"),
AutomationTask("VM Backup",
"Ansible + oVirt API",
"Daily Cron 02:00",
"backup_vm.yml: Snapshot → Export → Upload S3 → Delete Old"),
AutomationTask("Capacity Report",
"Python + oVirt SDK",
"Weekly Monday 09:00",
"capacity_report.py: CPU RAM Storage Usage Trend → Email Report"),
AutomationTask("Disaster Recovery",
"Ansible + oVirt API",
"DR Drill Quarterly / Actual DR",
"dr_failover.yml: Import VM → Start → Verify → Update DNS"),
]
print("=== Automation Tasks ===")
for t in tasks:
print(f" [{t.task}] Tool: {t.tool}")
print(f" Trigger: {t.trigger}")
print(f" Playbook: {t.playbook}")
Capacity Planning
# === Capacity Planning ===
@dataclass
class CapacityMetric:
resource: str
current_usage: str
threshold: str
forecast: str
action: str
capacity = [
CapacityMetric("CPU (Total Cluster)",
"65% average 85% peak",
"Warning 70% avg Critical 90% peak",
"เพิ่ม 5% ต่อเดือน → Full ใน 7 เดือน",
"เพิ่ม Host 2 ตัว ใน Q3"),
CapacityMetric("RAM (Total Cluster)",
"72% allocated 55% actual",
"Warning 80% allocated Critical 90%",
"เพิ่ม 8% ต่อเดือน → Full ใน 4 เดือน",
"เพิ่ม RAM แต่ละ Host 64GB → 128GB"),
CapacityMetric("Storage (NFS)",
"4.2TB / 6TB (70%)",
"Warning 80% Critical 90%",
"เพิ่ม 200GB ต่อเดือน → Full ใน 9 เดือน",
"เพิ่ม Storage Volume 4TB ใน Q3"),
CapacityMetric("Network (10Gbps Bond)",
"3.5Gbps peak 2.1Gbps avg",
"Warning 70% peak Critical 90%",
"เพิ่ม 10% ต่อเดือน → Full ใน 12 เดือน",
"พิจารณา 25Gbps Upgrade ใน Q4"),
]
print("=== Capacity Planning ===")
for c in capacity:
print(f" [{c.resource}] Current: {c.current_usage}")
print(f" Threshold: {c.threshold}")
print(f" Forecast: {c.forecast}")
print(f" Action: {c.action}")
เคล็ดลับ
- SLO: ตั้ง SLO ชัดเจน วัดทุกสัปดาห์ ใช้ Error Budget บริหาร Change
- Ansible: Automate ทุกงาน Manual ลด Toil
- Template: ใช้ VM Template มาตรฐาน สร้าง VM เร็ว Consistent
- HA: เปิด HA สำหรับ Critical VM Auto-restart เมื่อ Host ล่ม
- Capacity: วิเคราะห์ Trend ทุกเดือน วางแผนขยายล่วงหน้า
oVirt คืออะไร
Open Source KVM Virtualization Red Hat RHV Engine Host Web UI REST API Live Migration HA Snapshot Template Quota ฟรี Private Cloud
SRE Practices สำหรับ Virtualization มีอะไร
SLI SLO Error Budget VM Availability 99.9% Boot Time 60s Migration 30s Storage Latency 5ms Toil Reduction Automation Runbook
Monitoring ตั้งอย่างไร
Prometheus Node Exporter oVirt Exporter Grafana Zabbix ELK Host VM Engine Storage Network Alert CPU RAM Disk IOPS Latency
Automation ทำอย่างไร
Ansible ovirt.ovirt Terraform oVirt Provider Python SDK VM Provisioning Patching Backup Capacity Report DR Failover Runbook IaC Git
สรุป
oVirt KVM Virtualization SRE SLI SLO Error Budget Ansible Terraform Prometheus Grafana Monitoring Automation Capacity Planning Production