it

Ceph Storage Cluster Capacity Planning — วางแผน Storage ด้วย Ceph

Ceph Storage Cluster Capacity Planning — วางแผน Storage ด้วย Ceph

Ceph Storage Cluster คืออะไร

Ceph Storage Cluster Capacity Planning — วางแผน Storage ด้วย Ceph

Ceph เป็น open source distributed storage system ที่ให้บริการ object storage, block storage และ file storage บน single unified platform ออกแบบมาเพื่อ high availability, scalability และ self-healing ใช้ CRUSH algorithm สำหรับ data placement ไม่มี single point of failure

Components หลักของ Ceph ได้แก่ MON (Monitor) จัดการ cluster map และ consensus ต้องมีอย่างน้อย 3 nodes, OSD (Object Storage Daemon) เก็บข้อมูลจริง แต่ละ OSD ดูแล disk หนึ่งลูก, MDS (Metadata Server) จัดการ metadata สำหรับ CephFS, MGR (Manager) จัดการ monitoring, dashboard และ orchestration, RGW (RADOS Gateway) ให้บริการ S3/Swift compatible object storage

Capacity Planning สำหรับ Ceph สำคัญมากเพราะ replication factor กระทบ usable capacity โดยตรง (3x replication ใช้ raw space 3 เท่า), OSD placement groups ต้อง size ถูกต้องเพื่อ performance ที่ดี, network bandwidth ต้องเพียงพอสำหรับ replication traffic, recovery จาก disk failure ต้องมี spare capacity และ growth projection ต้องวางแผนล่วงหน้า 6-12 เดือน

ติดตั้ง Ceph Cluster

วิธีติดตั้ง Ceph ด้วย cephadm

# === ติดตั้ง Ceph Cluster ด้วย cephadm ===





# Prerequisites:


# - 3+ nodes with Ubuntu 22.04 or Rocky Linux 9


# - Each node: 4+ CPU cores, 8+ GB RAM, 1+ SSD/NVMe for OSD


# - Network: 10GbE recommended (minimum 1GbE)


# - Separate public and cluster networks recommended





# 1. Install cephadm on first node


curl --silent --remote-name --location https://download.ceph.com/rpm-reef/el9/noarch/cephadm


chmod +x cephadm


./cephadm add-repo --release reef


./cephadm install





# 2. Bootstrap cluster


cephadm bootstrap \


 --mon-ip 10.0.1.10 \


 --cluster-network 10.0.2.0/24 \


 --dashboard-password-noupdate \


 --initial-dashboard-password MyDashP@ss





# Dashboard: https://10.0.1.10:8443


# Default user: admin





# 3. Add hosts


# Copy SSH key to other nodes first


ssh-copy-id root@ceph-node2


ssh-copy-id root@ceph-node3





ceph orch host add ceph-node2 10.0.1.11


ceph orch host add ceph-node3 10.0.1.12





# 4. Add OSDs (auto-detect available disks)


ceph orch apply osd --all-available-devices





# Or specific devices


ceph orch daemon add osd ceph-node1:/dev/sdb


ceph orch daemon add osd ceph-node2:/dev/sdb


ceph orch daemon add osd ceph-node3:/dev/sdb





# 5. Set MON count


ceph orch apply mon --placement="3 ceph-node1 ceph-node2 ceph-node3"





# 6. Enable features


ceph mgr module enable dashboard


ceph mgr module enable prometheus


ceph mgr module enable pg_autoscaler





# 7. Create pools


ceph osd pool create rbd-pool 128


ceph osd pool set rbd-pool size 3


ceph osd pool set rbd-pool min_size 2


ceph osd pool application enable rbd-pool rbd





# Create CephFS


ceph fs volume create cephfs





# Create RGW (S3)


ceph orch apply rgw myrgw --placement="2 ceph-node1 ceph-node2" --port=7480





# 8. Verify cluster


ceph -s


# cluster:


# id: abc123


# health: HEALTH_OK


# services:


# mon: 3 daemons


# mgr: 2 active


# osd: 9 osds: 9 up, 9 in


# rgw: 2 daemons active





ceph osd tree


ceph df





echo "Ceph cluster installed"

Capacity Planning และ Sizing

เครื่องมือคำนวณ capacity

#!/usr/bin/env python3


# ceph_capacity_planner.py — Ceph Cluster Capacity Planning


import json


import math


import logging


from typing import Dict, List


from dataclasses import dataclass





logging.basicConfig(level=logging.INFO)


logger = logging.getLogger("capacity")





@dataclass


class DiskSpec:


 size_tb: float


 type: str # hdd, ssd, nvme


 rpm: int = 0 # HDD only


 iops: int = 0


 throughput_mbps: int = 0





@dataclass


class NodeSpec:


 hostname: str


 cpu_cores: int


 ram_gb: int


 disks: List[DiskSpec]


 network_gbps: float = 10





class CephCapacityPlanner:


 def __init__(self):


 self.nodes: List[NodeSpec] = []


 


 def add_node(self, node: NodeSpec):


 self.nodes.append(node)


 


 def calculate_raw_capacity(self):


 total_tb = 0


 by_type = {"hdd": 0, "ssd": 0, "nvme": 0}


 osd_count = 0


 


 for node in self.nodes:


 for disk in node.disks:


 total_tb += disk.size_tb


 by_type[disk.type] += disk.size_tb


 osd_count += 1


 


 return {


 "total_raw_tb": round(total_tb, 1),


 "by_type": {k: round(v, 1) for k, v in by_type.items() if v > 0},


 "osd_count": osd_count,


 "node_count": len(self.nodes),


 }


 


 def calculate_usable_capacity(self, replication_factor=3,


 overhead_pct=5, max_fill_pct=85):


 raw = self.calculate_raw_capacity()


 raw_tb = raw["total_raw_tb"]


 


 # After replication


 after_replication = raw_tb / replication_factor


 


 # Reserved for overhead (BlueStore, metadata)


 after_overhead = after_replication * (1 - overhead_pct / 100)


 


 # Max usable (never fill above 85%)


 usable = after_overhead * (max_fill_pct / 100)


 


 return {


 "raw_tb": round(raw_tb, 1),


 "replication_factor": replication_factor,


 "after_replication_tb": round(after_replication, 1),


 "overhead_pct": overhead_pct,


 "after_overhead_tb": round(after_overhead, 1),


 "max_fill_pct": max_fill_pct,


 "usable_tb": round(usable, 1),


 "efficiency_pct": round(usable / raw_tb * 100, 1),


 "osd_count": raw["osd_count"],


 }


 


 def calculate_pg_count(self, pool_pct_of_data=100, osd_count=None,


 replication_factor=3, target_pgs_per_osd=100):


 if osd_count is None:


 osd_count = self.calculate_raw_capacity()["osd_count"]


 


 # Formula: (target_pgs_per_osd * osd_count * pool_pct / 100) / replication_factor


 ideal = (target_pgs_per_osd * osd_count * pool_pct_of_data / 100) / replication_factor


 


 # Round to nearest power of 2


 pg_count = 2 ** math.ceil(math.log2(max(ideal, 1)))


 


 return {


 "osd_count": osd_count,


 "pool_data_pct": pool_pct_of_data,


 "replication_factor": replication_factor,


 "ideal_pgs": round(ideal),


 "recommended_pg_count": pg_count,


 }


 


 def forecast_growth(self, current_usage_tb, monthly_growth_tb,


 months=12):


 capacity = self.calculate_usable_capacity()


 usable = capacity["usable_tb"]


 


 forecast = []


 for month in range(1, months + 1):


 projected = current_usage_tb + (monthly_growth_tb * month)


 utilization = projected / usable * 100


 


 if utilization >= 85:


 status = "CRITICAL"


 elif utilization >= 70:


 status = "WARNING"


 else:


 status = "OK"


 


 forecast.append({


 "month": month,


 "projected_tb": round(projected, 1),


 "utilization_pct": round(utilization, 1),


 "free_tb": round(usable - projected, 1),


 "status": status,


 })


 


 # Find when capacity runs out


 months_until_full = None


 for f in forecast:


 if f["utilization_pct"] >= 85:


 months_until_full = f["month"]


 break


 


 return {


 "current_usage_tb": current_usage_tb,


 "usable_capacity_tb": usable,


 "monthly_growth_tb": monthly_growth_tb,


 "months_until_85pct": months_until_full,


 "forecast": forecast,


 }


 


 def recommend_hardware(self, target_usable_tb, workload="mixed"):


 configs = {


 "archive": {


 "disk": DiskSpec(18, "hdd", rpm=7200, iops=100, throughput_mbps=200),


 "disks_per_node": 12, "nodes_min": 3, "replication": 3,


 },


 "mixed": {


 "disk": DiskSpec(4, "ssd", iops=50000, throughput_mbps=500),


 "disks_per_node": 6, "nodes_min": 3, "replication": 3,


 },


 "performance": {


 "disk": DiskSpec(2, "nvme", iops=200000, throughput_mbps=3000),


 "disks_per_node": 4, "nodes_min": 3, "replication": 3,


 },


 }


 


 config = configs.get(workload, configs["mixed"])


 


 raw_needed = target_usable_tb / (1 / config["replication"]) / 0.80


 disks_needed = math.ceil(raw_needed / config["disk"].size_tb)


 nodes_needed = max(


 math.ceil(disks_needed / config["disks_per_node"]),


 config["nodes_min"],


 )


 


 actual_disks = nodes_needed * config["disks_per_node"]


 actual_raw = actual_disks * config["disk"].size_tb


 actual_usable = actual_raw / config["replication"] * 0.80


 


 return {


 "workload": workload,


 "target_usable_tb": target_usable_tb,


 "recommendation": {


 "nodes": nodes_needed,


 "disks_per_node": config["disks_per_node"],


 "total_disks": actual_disks,


 "disk_type": config["disk"].type,


 "disk_size_tb": config["disk"].size_tb,


 "raw_capacity_tb": round(actual_raw, 1),


 "usable_capacity_tb": round(actual_usable, 1),


 "replication": config["replication"],


 },


 }





planner = CephCapacityPlanner()





for i in range(3):


 planner.add_node(NodeSpec(


 hostname=f"ceph-node{i+1}",


 cpu_cores=16, ram_gb=64,


 disks=[DiskSpec(4, "ssd", iops=50000) for _ in range(6)],


 network_gbps=25,


 ))





print("Raw:", json.dumps(planner.calculate_raw_capacity(), indent=2))


print("Usable:", json.dumps(planner.calculate_usable_capacity(), indent=2))


print("PGs:", json.dumps(planner.calculate_pg_count(pool_pct_of_data=80), indent=2))


print("Forecast:", json.dumps(planner.forecast_growth(10, 2, months=12), indent=2))


print("Recommend:", json.dumps(planner.recommend_hardware(50, "mixed"), indent=2))

Monitoring Cluster Health

Monitor Ceph cluster health และ performance

#!/usr/bin/env python3


# ceph_monitor.py — Ceph Cluster Health Monitoring


import subprocess


import json


import logging


from datetime import datetime


from typing import Dict, List





logging.basicConfig(level=logging.INFO)


logger = logging.getLogger("ceph_mon")





class CephMonitor:


 def __init__(self):


 self.history = []


 


 def _run_ceph(self, *args):


 cmd = ["ceph", "--format", "json"] + list(args)


 result = subprocess.run(cmd, capture_output=True, text=True)


 if result.returncode != 0:


 logger.error(f"ceph command failed: {result.stderr}")


 return {}


 return json.loads(result.stdout)


 


 def get_cluster_status(self):


 status = self._run_ceph("status")


 


 health = status.get("health", {})


 osd_map = status.get("osdmap", {}).get("osdmap", {})


 pg_map = status.get("pgmap", {})


 


 return {


 "timestamp": datetime.utcnow().isoformat(),


 "health": health.get("status", "UNKNOWN"),


 "health_checks": list(health.get("checks", {}).keys()),


 "osds": {


 "total": osd_map.get("num_osds", 0),


 "up": osd_map.get("num_up_osds", 0),


 "in": osd_map.get("num_in_osds", 0),


 },


 "pgs": {


 "total": pg_map.get("num_pgs", 0),


 "active_clean": pg_map.get("num_pgs_by_state", [{}])[0].get("count", 0)


 if pg_map.get("num_pgs_by_state") else 0,


 },


 "storage": {


 "total_bytes": pg_map.get("bytes_total", 0),


 "used_bytes": pg_map.get("bytes_used", 0),


 "available_bytes": pg_map.get("bytes_avail", 0),


 "used_pct": round(


 pg_map.get("bytes_used", 0) / max(pg_map.get("bytes_total", 1), 1) * 100, 1


 ),


 },


 "io": {


 "read_bytes_sec": pg_map.get("read_bytes_sec", 0),


 "write_bytes_sec": pg_map.get("write_bytes_sec", 0),


 "read_op_sec": pg_map.get("read_op_per_sec", 0),


 "write_op_sec": pg_map.get("write_op_per_sec", 0),


 },


 }


 


 def get_osd_status(self):


 osds = self._run_ceph("osd", "df")


 


 osd_list = []


 for osd in osds.get("nodes", []):


 utilization = osd.get("utilization", 0)


 


 status = "ok"


 if utilization > 85:


 status = "critical"


 elif utilization > 75:


 status = "warning"


 


 osd_list.append({


 "id": osd.get("id"),


 "name": osd.get("name"),


 "total_kb": osd.get("kb"),


 "used_kb": osd.get("kb_used"),


 "available_kb": osd.get("kb_avail"),


 "utilization_pct": round(utilization, 1),


 "status": status,


 })


 


 return sorted(osd_list, key=lambda x: x["utilization_pct"], reverse=True)


 


 def get_pool_stats(self):


 pools = self._run_ceph("osd", "pool", "stats")


 


 pool_list = []


 for pool in pools:


 pool_list.append({


 "name": pool.get("pool_name"),


 "id": pool.get("pool_id"),


 "stored_bytes": pool.get("stored", 0),


 "objects": pool.get("objects", 0),


 "read_bytes_sec": pool.get("rd_bytes", 0),


 "write_bytes_sec": pool.get("wr_bytes", 0),


 })


 


 return pool_list


 


 def health_check(self):


 status = self.get_cluster_status()


 osds = self.get_osd_status()


 


 issues = []


 


 if status["health"] != "HEALTH_OK":


 issues.append(f"Cluster health: {status['health']}")


 for check in status["health_checks"]:


 issues.append(f" - {check}")


 


 down_osds = status["osds"]["total"] - status["osds"]["up"]


 if down_osds > 0:


 issues.append(f"{down_osds} OSDs are down")


 


 if status["storage"]["used_pct"] > 80:


 issues.append(f"Storage usage: {status['storage']['used_pct']}%")


 


 for osd in osds:


 if osd["status"] == "critical":


 issues.append(f"OSD {osd['name']} at {osd['utilization_pct']}%")


 


 return {


 "timestamp": status["timestamp"],


 "healthy": len(issues) == 0,


 "issues": issues,


 "summary": status,


 }





# monitor = CephMonitor()


# health = monitor.health_check()


# print(json.dumps(health, indent=2))

Performance Tuning และ Optimization

Ceph Storage Cluster Capacity Planning — วางแผน Storage ด้วย Ceph

ปรับแต่ง Ceph performance

=== Ceph Performance Tuning ===

1. BlueStore Tuning

Dedicated WAL/DB on NVMe for HDD OSDs

ceph osd set-full-ratio 0.90

ceph osd set-nearfull-ratio 0.80

ceph osd set-backfillfull-ratio 0.85

Per-OSD settings in ceph.conf

[osd]

bluestore_cache_size = 4294967296 # 4GB cache per OSD

bluestore_cache_autotune = true

bluestore_cache_meta_ratio = 0.4

bluestore_cache_kv_ratio = 0.4

bluestore_cache_data_ratio = 0.2

# Async IO

bdev_async_discard = true

bdev_enable_discard = true

# Recovery tuning (balance between recovery speed and client IO)

osd_recovery_max_active = 3

osd_recovery_max_active_hdd = 1

osd_recovery_max_active_ssd = 10

osd_max_backfills = 1

osd_recovery_sleep = 0

osd_recovery_priority = 5

2. Network Tuning

Separate public and cluster networks

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน web page design layout

[global]

public_network = 10.0.1.0/24

cluster_network = 10.0.2.0/24

ms_type = async+posix

ms_async_op_threads = 5

System network tuning

sudo sysctl -w net.core.rmem_max=67108864

sudo sysctl -w net.core.wmem_max=67108864

แนะนำเพิ่มเติม — SiamCafeBook

sudo sysctl -w net.core.rmem_default=33554432

sudo sysctl -w net.core.wmem_default=33554432

sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"

sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 33554432"

3. PG Autoscaler

ceph mgr module enable pg_autoscaler

ceph osd pool set rbd-pool pg_autoscale_mode on

Check PG status

ceph osd pool autoscale-status

4. Pool Optimization

Set compression for suitable pools

ceph osd pool set archive-pool compression_algorithm snappy

ceph osd pool set archive-pool compression_mode aggressive

ceph osd pool set archive-pool compression_required_ratio 0.875

5. Benchmark

RADOS bench (sequential write)

rados bench -p rbd-pool 60 write --no-cleanup

RADOS bench (sequential read)

rados bench -p rbd-pool 60 seq

RADOS bench (random read)

rados bench -p rbd-pool 60 rand

RBD bench

rbd create test-image --size 10G --pool rbd-pool

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ WiFi 6E Design Pub Sub Architecture

rbd bench test-image --io-type write --io-size 4K --io-threads 16 --pool rbd-pool

rbd bench test-image --io-type read --io-size 4K --io-threads 16 --pool rbd-pool

fio with RBD

[global]

ioengine=rbd

clientname=admin

pool=rbd-pool

rbdname=test-image

rw=randread

bs=4k

numjobs=4

iodepth=32

runtime=60

echo "Ceph tuning complete"

Scaling และ Maintenance

ขยาย cluster และ maintenance

=== Ceph Cluster Scaling ===

แนะนำเพิ่มเติม — iCafeForex

1. Add New OSD Host

Add host to cluster

ceph orch host add ceph-node4 10.0.1.13

Add OSDs on new host

ceph orch daemon add osd ceph-node4:/dev/sdb

ceph orch daemon add osd ceph-node4:/dev/sdc

ceph orch daemon add osd ceph-node4:/dev/sdd

Monitor rebalancing

ceph -w

Watch PG states: active+remapped, backfilling, recovering

Wait until all PGs are active+clean

2. Remove OSD (Graceful)

Mark OSD out (starts rebalancing)

ceph osd out osd.5

Wait for rebalancing

เนื้อหาเกี่ยวข้อง — แนะนำให้อ่าน AWS SageMaker Infrastructure as Code

while ! ceph osd safe-to-destroy osd.5; do

echo "Waiting for safe-to-destroy..."

sleep 30

done

Stop and remove

ceph orch daemon rm osd.5

ceph osd purge osd.5 --yes-i-really-mean-it

3. Replace Failed Disk

Identify failed OSD

ceph osd tree | grep down

ceph health detail

Remove failed OSD

ceph osd out osd.3

ceph osd purge osd.3 --yes-i-really-mean-it

Replace disk physically, then add new OSD

ceph orch daemon add osd ceph-node2:/dev/sdc

4. Maintenance Window

Set noout flag (prevent rebalancing during maintenance)

ceph osd set noout

Perform maintenance...

Restart services, update firmware, etc.

sudo systemctl restart ceph-osd@3

Unset flag after maintenance

ceph osd unset noout

5. Upgrade Ceph

Check current version

ceph versions

Set container image for upgrade

ceph orch upgrade start --image quay.io/ceph/ceph:v18.2.2

Monitor upgrade progress

ceph orch upgrade status

เนื้อหาเกี่ยวข้อง — ทำความเข้าใจ firewall router คือ

6. Backup Strategies

RBD snapshots

rbd snap create rbd-pool/myimage@snap1

rbd snap ls rbd-pool/myimage

Export snapshot

rbd export rbd-pool/myimage@snap1 /backup/myimage-snap1.img

Incremental export

rbd export-diff rbd-pool/myimage@snap1 --from-snap snap0 /backup/myimage-diff.img

CephFS snapshots

mkdir /mnt/cephfs/.snap/daily-$(date +%Y%m%d)

7. Monitoring Alerts

Prometheus alerting rules for Ceph

groups:

  • name: ceph

rules:

  • alert: CephHealthWarning

expr: ceph_health_status == 1

for: 5m

  • alert: CephHealthError

expr: ceph_health_status == 2

for: 1m

  • alert: CephOSDDown

expr: count(ceph_osd_up == 0) > 0

for: 5m

  • alert: CephStorageFull

expr: ceph_cluster_total_used_bytes / ceph_cluster_total_bytes > 0.80

for: 10m

echo "Ceph scaling and maintenance documented"

FAQ คำถามที่พบบ่อย

Q: Ceph ต้องการ hardware อะไรบ้าง?

A: ขั้นต่ำสำหรับ production 3 nodes แต่ละ node ต้องมี CPU 4+ cores (แนะนำ 1 core ต่อ OSD), RAM 4GB base + 5GB ต่อ OSD (เช่น 6 OSDs ต้อง 34GB), 1 SSD/NVMe ต่อ OSD สำหรับ data (หรือ HDD กับ NVMe WAL/DB), network 10GbE ขั้นต่ำ (25GbE สำหรับ NVMe clusters) SSD สำหรับ MON และ MGR (ไม่ต้องใหญ่ 50-100GB พอ) สำหรับ production จริงจัง แยก public network และ cluster network

Q: Replication 3 กับ Erasure Coding ต่างกันอย่างไร?

A: Replication 3 เก็บ data 3 copies ใช้ raw space 3 เท่า แต่ recovery เร็ว read performance ดี (อ่านจาก copy ไหนก็ได้) เหมาะสำหรับ hot data ที่ต้องการ low latency Erasure Coding (เช่น k=4, m=2) ใช้ raw space 1.5 เท่า ประหยัดกว่ามาก แต่ write latency สูงกว่า, recovery ช้ากว่า, ใช้ CPU มากกว่า เหมาะสำหรับ cold/archive data ที่ต้องการ capacity efficiency

Q: PG count ตั้งเท่าไหรดี?

A: ใช้สูตร (target_pgs_per_osd * total_osds * pool_pct) / replication_factor แล้ว round up เป็น power of 2 เช่น 18 OSDs, replication 3, pool ใช้ 80% ของ data = (100 * 18 * 0.8) / 3 = 480 ปัดเป็น 512 PGs ต่ำกว่า 100 PGs per OSD จะเสีย performance สูงกว่า 200 PGs per OSD จะใช้ memory มาก Ceph Reef+ มี pg_autoscaler ที่ adjust อัตโนมัติ แนะนำเปิดใช้

Q: Ceph เหมาะกับ workload ประเภทไหน?

A: เหมาะมากสำหรับ object storage (S3 compatible) ใช้แทน MinIO สำหรับ large scale, block storage สำหรับ VMs (Proxmox, OpenStack), Kubernetes persistent volumes (Rook-Ceph), backup storage และ data lake ไม่เหมาะสำหรับ workloads ที่ต้องการ ultra-low latency (ใช้ local NVMe แทน), small clusters น้อยกว่า 3 nodes, single-tenant database ที่ต้องการ consistent latency

XM Legend · เทรดเดอร์ & ผู้สอน Forex 13 ปี

ผู้ก่อตั้ง SiamCafe ตั้งแต่ปี 1997 · เทรดเดอร์สาย Forex มากกว่า 13 ปี ได้รับการยกย่องเป็น XM Legend · แบ่งปันความรู้ Forex, ไอที, AI และการเทรด จากประสบการณ์จริงในตลาดจริง